1
|
Klauschen F, Dippel J, Keyl P, Jurmeister P, Bockmayr M, Mock A, Buchstab O, Alber M, Ruff L, Montavon G, Müller KR. Toward Explainable Artificial Intelligence for Precision Pathology. ANNUAL REVIEW OF PATHOLOGY 2024; 19:541-570. [PMID: 37871132 DOI: 10.1146/annurev-pathmechdis-051222-113147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
The rapid development of precision medicine in recent years has started to challenge diagnostic pathology with respect to its ability to analyze histological images and increasingly large molecular profiling data in a quantitative, integrative, and standardized way. Artificial intelligence (AI) and, more precisely, deep learning technologies have recently demonstrated the potential to facilitate complex data analysis tasks, including clinical, histological, and molecular data for disease classification; tissue biomarker quantification; and clinical outcome prediction. This review provides a general introduction to AI and describes recent developments with a focus on applications in diagnostic pathology and beyond. We explain limitations including the black-box character of conventional AI and describe solutions to make machine learning decisions more transparent with so-called explainable AI. The purpose of the review is to foster a mutual understanding of both the biomedical and the AI side. To that end, in addition to providing an overview of the relevant foundations in pathology and machine learning, we present worked-through examples for a better practical understanding of what AI can achieve and how it should be done.
Collapse
Affiliation(s)
- Frederick Klauschen
- Institute of Pathology, Ludwig-Maximilians-Universität München, Munich, Germany;
- Institute of Pathology, Charité Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute for the Foundations of Learning and Data (BIFOLD), Berlin, Germany
- German Cancer Consortium, German Cancer Research Center (DKTK/DKFZ), Munich Partner Site, Munich, Germany
| | - Jonas Dippel
- Berlin Institute for the Foundations of Learning and Data (BIFOLD), Berlin, Germany
- Machine Learning Group, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Berlin, Germany;
| | - Philipp Keyl
- Institute of Pathology, Ludwig-Maximilians-Universität München, Munich, Germany;
| | - Philipp Jurmeister
- Institute of Pathology, Ludwig-Maximilians-Universität München, Munich, Germany;
- German Cancer Consortium, German Cancer Research Center (DKTK/DKFZ), Munich Partner Site, Munich, Germany
| | - Michael Bockmayr
- Institute of Pathology, Charité Universitätsmedizin Berlin, Berlin, Germany
- Department of Pediatric Hematology and Oncology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Research Institute Children's Cancer Center Hamburg, Hamburg, Germany
| | - Andreas Mock
- Institute of Pathology, Ludwig-Maximilians-Universität München, Munich, Germany;
- German Cancer Consortium, German Cancer Research Center (DKTK/DKFZ), Munich Partner Site, Munich, Germany
| | - Oliver Buchstab
- Institute of Pathology, Ludwig-Maximilians-Universität München, Munich, Germany;
| | - Maximilian Alber
- Institute of Pathology, Charité Universitätsmedizin Berlin, Berlin, Germany
- Aignostics, Berlin, Germany
| | | | - Grégoire Montavon
- Berlin Institute for the Foundations of Learning and Data (BIFOLD), Berlin, Germany
- Machine Learning Group, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Berlin, Germany;
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | - Klaus-Robert Müller
- Berlin Institute for the Foundations of Learning and Data (BIFOLD), Berlin, Germany
- Machine Learning Group, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Berlin, Germany;
- Department of Artificial Intelligence, Korea University, Seoul, Korea
- Max Planck Institute for Informatics, Saarbrücken, Germany
| |
Collapse
|
2
|
Nordin NI, Mustafa WA, Lola MS, Madi EN, Kamil AA, Nasution MD, K. Abdul Hamid AA, Zainuddin NH, Aruchunan E, Abdullah MT. Enhancing COVID-19 Classification Accuracy with a Hybrid SVM-LR Model. Bioengineering (Basel) 2023; 10:1318. [PMID: 38002441 PMCID: PMC10669812 DOI: 10.3390/bioengineering10111318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 10/03/2023] [Accepted: 10/09/2023] [Indexed: 11/26/2023] Open
Abstract
Support ector achine (SVM) is a newer machine learning algorithm for classification, while logistic regression (LR) is an older statistical classification method. Despite the numerous studies contrasting SVM and LR, new improvements such as bagging and ensemble have been applied to them since these comparisons were made. This study proposes a new hybrid model based on SVM and LR for predicting small events per variable (EPV). The performance of the hybrid, SVM, and LR models with different EPV values was evaluated using COVID-19 data from December 2019 to May 2020 provided by the WHO. The study found that the hybrid model had better classification performance than SVM and LR in terms of accuracy, mean squared error (MSE), and root mean squared error (RMSE) for different EPV values. This hybrid model is particularly important for medical authorities and practitioners working in the face of future pandemics.
Collapse
Affiliation(s)
- Noor Ilanie Nordin
- Faculty of Ocean Engineering Technology and Informatics, Universiti Malaysia Terengganu, Kuala Nerus 21030, Terengganu, Malaysia or (N.I.N.); (A.A.K.A.H.)
- Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA Kelantan, Bukit Ilmu, Machang 18500, Kelantan, Malaysia
| | - Wan Azani Mustafa
- Faculty of Electrical Engineering & Technology, Pauh Putra Campus, Universiti Malaysia Perlis (UniMAP), Arau 02600, Perlis, Malaysia
- Centre of Excellence for Advanced Computing, Pauh Putra Campus, Universiti Malaysia Perlis (UniMAP), Arau 02600, Perlis, Malaysia
| | - Muhamad Safiih Lola
- Faculty of Ocean Engineering Technology and Informatics, Universiti Malaysia Terengganu, Kuala Nerus 21030, Terengganu, Malaysia or (N.I.N.); (A.A.K.A.H.)
- Special Interest Group on Modeling and Data Analytics (SIGMDA), Universiti Malaysia Terengganu, Kuala Nerus 21030, Terengganu, Malaysia
| | - Elissa Nadia Madi
- Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin (UniSZA), Besut Campus, Besut 22200, Terengganu, Malaysia;
| | - Anton Abdulbasah Kamil
- Faculty of Economics, Administrative and Social Sciences, Istanbul Gelisim University, Cihangir Mah. Şehit Jandarma Komando Er Hakan Öner Sk. No:1 Avcılar, İstanbul 34310, Turkey;
| | - Marah Doly Nasution
- Faculty of Teacher and Education, University Muhammadiyah Sumatera Utara, Jl. Kapten Muchtar Basri No.3, Glugur Darat II, Kec. Medan Tim., Kota Medan 20238, Sumatera Utara, Indonesia;
| | - Abdul Aziz K. Abdul Hamid
- Faculty of Ocean Engineering Technology and Informatics, Universiti Malaysia Terengganu, Kuala Nerus 21030, Terengganu, Malaysia or (N.I.N.); (A.A.K.A.H.)
- Special Interest Group on Applied Informatics and Intelligent Applications (AINIA), Universiti Malaysia Terengganu, Kuala Nerus 21030, Terengganu, Malaysia
| | - Nurul Hila Zainuddin
- Mathematics Department, Faculty of Science and Mathematics, Universiti Pendidikan Sultan Idris, Tanjong Malim 53900, Perak Darul Ridzuan, Malaysia;
| | - Elayaraja Aruchunan
- Department of Decision Science, Faculty of Business and Economics, University Malaya, Kuala Lumpur 50603, Malaysia;
| | - Mohd Tajuddin Abdullah
- Fellow Academy of Sciences Malaysia, Level 20, West Wing Tingkat 20, Menara MATRADE, Jalan Sultan Haji Ahmad Shah, Kuala Lumpur 50480, Malaysia;
| |
Collapse
|
3
|
Ditz JC, Reuter B, Pfeifer N. Inherently interpretable position-aware convolutional motif kernel networks for biological sequencing data. Sci Rep 2023; 13:17216. [PMID: 37821530 PMCID: PMC10567796 DOI: 10.1038/s41598-023-44175-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 10/04/2023] [Indexed: 10/13/2023] Open
Abstract
Artificial neural networks show promising performance in detecting correlations within data that are associated with specific outcomes. However, the black-box nature of such models can hinder the knowledge advancement in research fields by obscuring the decision process and preventing scientist to fully conceptualize predicted outcomes. Furthermore, domain experts like healthcare providers need explainable predictions to assess whether a predicted outcome can be trusted in high stakes scenarios and to help them integrating a model into their own routine. Therefore, interpretable models play a crucial role for the incorporation of machine learning into high stakes scenarios like healthcare. In this paper we introduce Convolutional Motif Kernel Networks, a neural network architecture that involves learning a feature representation within a subspace of the reproducing kernel Hilbert space of the position-aware motif kernel function. The resulting model enables to directly interpret and evaluate prediction outcomes by providing a biologically and medically meaningful explanation without the need for additional post-hoc analysis. We show that our model is able to robustly learn on small datasets and reaches state-of-the-art performance on relevant healthcare prediction tasks. Our proposed method can be utilized on DNA and protein sequences. Furthermore, we show that the proposed method learns biologically meaningful concepts directly from data using an end-to-end learning scheme.
Collapse
Affiliation(s)
- Jonas C Ditz
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, Tübingen, 72076, Germany.
| | - Bernhard Reuter
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, Tübingen, 72076, Germany
| | - Nico Pfeifer
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, Tübingen, 72076, Germany.
| |
Collapse
|
4
|
Barbero-Aparicio JA, Cuesta-Lopez S, García-Osorio CI, Pérez-Rodríguez J, García-Pedrajas N. Nonlinear physics opens a new paradigm for accurate transcription start site prediction. BMC Bioinformatics 2022; 23:565. [PMID: 36585618 PMCID: PMC9801560 DOI: 10.1186/s12859-022-05129-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 12/27/2022] [Indexed: 12/31/2022] Open
Abstract
There is evidence that DNA breathing (spontaneous opening of the DNA strands) plays a relevant role in the interactions of DNA with other molecules, and in particular in the transcription process. Therefore, having physical models that can predict these openings is of interest. However, this source of information has not been used before either in transcription start sites (TSSs) or promoter prediction. In this article, one such model is used as an additional information source that, when used by a machine learning (ML) model, improves the results of current methods for the prediction of TSSs. In addition, we provide evidence on the validity of the physical model, as it is able by itself to predict TSSs with high accuracy. This opens an exciting avenue of research at the intersection of statistical mechanics and ML, where ML models in bioinformatics can be improved using physical models of DNA as feature extractors.
Collapse
Affiliation(s)
- José Antonio Barbero-Aparicio
- grid.23520.360000 0000 8569 1592Departamento de Informática, Universidad de Burgos, Avda. de Cantabria s/n, 09006 Burgos, Spain
| | - Santiago Cuesta-Lopez
- grid.23520.360000 0000 8569 1592Universidad de Burgos, Hospital del Rey, s/n, 09001 Burgos, Spain ,ICAMCyL Foundation, Internacional Center for Advanced Materials and Raw Materials of Castilla y León, León Technology Park, main building, first floor, offices 106-108, C/Julia Morros s/n, Armunia, 24009 León, Spain
| | - César Ignacio García-Osorio
- grid.23520.360000 0000 8569 1592Departamento de Informática, Universidad de Burgos, Avda. de Cantabria s/n, 09006 Burgos, Spain
| | - Javier Pérez-Rodríguez
- grid.449008.10000 0004 1795 4150Departamento de Métodos Cuantitativos, Universidad de Loyola Andalucía, Escritor Castilla Aguayo, 4, 14004 Córdoba, Spain
| | - Nicolás García-Pedrajas
- grid.411901.c0000 0001 2183 9102Department of Computing and Numerical Analysis, University of Córdoba, Edificio Albert Einstein, Campus de Rabanales, 14071 Córdoba, Spain
| |
Collapse
|
5
|
Eberle O, Buttner J, Krautli F, Muller KR, Valleriani M, Montavon G. Building and Interpreting Deep Similarity Models. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:1149-1161. [PMID: 32870784 DOI: 10.1109/tpami.2020.3020738] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Many learning algorithms such as kernel machines, nearest neighbors, clustering, or anomaly detection, are based on distances or similarities. Before similarities are used for training an actual machine learning model, we would like to verify that they are bound to meaningful patterns in the data. In this paper, we propose to make similarities interpretable by augmenting them with an explanation. We develop BiLRP, a scalable and theoretically founded method to systematically decompose the output of an already trained deep similarity model on pairs of input features. Our method can be expressed as a composition of LRP explanations, which were shown in previous works to scale to highly nonlinear models. Through an extensive set of experiments, we demonstrate that BiLRP robustly explains complex similarity models, e.g., built on VGG-16 deep neural network features. Additionally, we apply our method to an open problem in digital humanities: detailed assessment of similarity between historical documents, such as astronomical tables. Here again, BiLRP provides insight and brings verifiability into a highly engineered and problem-specific similarity model.
Collapse
|
6
|
Jankovic B, Gojobori T. From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome. Hum Genomics 2022; 16:7. [PMID: 35180894 PMCID: PMC8855580 DOI: 10.1186/s40246-022-00376-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/02/2022] [Indexed: 11/25/2022] Open
Abstract
Identification of genomic signals as indicators for functional genomic elements is one of the areas that received early and widespread application of machine learning methods. With time, the methods applied grew in variety and generally exhibited a tendency to improve their ability to identify some major genomic and transcriptomics signals. The evolution of machine learning in genomics followed a similar path to applications of machine learning in other fields. These were impacted in a major way by three dominant developments, namely an enormous increase in availability and quality of data, a significant increase in computational power available to machine learning applications, and finally, new machine learning paradigms, of which deep learning is the most well-known example. It is not easy in general to distinguish factors leading to improvements in results of applications of machine learning. This is even more so in the field of genomics, where the advent of next-generation sequencing and the increased ability to perform functional analysis of raw data have had a major effect on the applicability of machine learning in OMICS fields. In this paper, we survey the results from a subset of published work in application of machine learning in the recognition of genomic signals and regions in human genome and summarize some lessons learnt from this endeavor. There is no doubt that a significant progress has been made both in terms of accuracy and reliability of models. Questions remain however whether the progress has been sufficient and what these developments bring to the field of genomics in general and human genomics in particular. Improving usability, interpretability and accuracy of models remains an important open challenge for current and future research in application of machine learning and more generally of artificial intelligence methods in genomics.
Collapse
Affiliation(s)
- Boris Jankovic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. .,Division of Biological and Environmental Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| |
Collapse
|
7
|
Vinayagam A, Othman ML, Veerasamy V, Saravan Balaji S, Ramaiyan K, Radhakrishnan P, Raman MD, Abdul Wahab NI. A random subspace ensemble classification model for discrimination of power quality events in solar PV microgrid power network. PLoS One 2022; 17:e0262570. [PMID: 35085307 PMCID: PMC8794120 DOI: 10.1371/journal.pone.0262570] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Accepted: 12/29/2021] [Indexed: 11/18/2022] Open
Abstract
This study proposes SVM based Random Subspace (RS) ensemble classifier to discriminate different Power Quality Events (PQEs) in a photovoltaic (PV) connected Microgrid (MG) model. The MG model is developed and simulated with the presence of different PQEs (voltage and harmonic related signals and distinctive transients) in both on-grid and off-grid modes of MG network, respectively. In the pre-stage of classification, the features are extracted from numerous PQE signals by Discrete Wavelet Transform (DWT) analysis, and the extracted features are used to learn the classifiers at the final stage. In this study, first three Kernel types of SVM classifiers (Linear, Quadratic, and Cubic) are used to predict the different PQEs. Among the results that Cubic kernel SVM classifier offers higher accuracy and better performance than other kernel types (Linear and Quadradic). Further, to enhance the accuracy of SVM classifiers, a SVM based RS ensemble model is proposed and its effectiveness is verified with the results of kernel based SVM classifiers under the standard test condition (STC) and varying solar irradiance of PV in real time. From the final results, it can be concluded that the proposed method is more robust and offers superior performance with higher accuracy of classification than kernel based SVM classifiers.
Collapse
Affiliation(s)
- Arangarajan Vinayagam
- Department of Electrical and Electronics Engineering, New Horizon College of Engineering, Bangalore, India
| | - Mohammad Lutfi Othman
- Advanced Lightning, Power and Energy Research (ALPER), Department of Electrical and Electronics Engineering, Universiti Putra Malaysia (UPM), Selangor, Malaysia
- * E-mail:
| | - Veerapandiyan Veerasamy
- School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
| | - Suganthi Saravan Balaji
- Department of Information Technology, College of Engineering and Computer Science, Lebanese French University, Erbil, Kurdistan Region, Iraq
| | - Kalaivani Ramaiyan
- Department of Electrical and Electronics Engineering, Rajalakshmi Engineering College, Chennai, India
| | - Padmavathi Radhakrishnan
- Department of Electrical and Electronics Engineering, Rajalakshmi Engineering College, Chennai, India
| | - Mohan Das Raman
- Department of Electrical and Electronics Engineering, New Horizon College of Engineering, Bangalore, India
| | - Noor Izzri Abdul Wahab
- Advanced Lightning, Power and Energy Research (ALPER), Department of Electrical and Electronics Engineering, Universiti Putra Malaysia (UPM), Selangor, Malaysia
| |
Collapse
|
8
|
Perez-Rodriguez J, de Haro-Garcia A, Garcia-Pedrajas N. Floating Search Methodology for Combining Classification Models for Site Recognition in DNA Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2471-2482. [PMID: 32078558 DOI: 10.1109/tcbb.2020.2974221] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Recognition of the functional sites of genes, such as translation initiation sites, donor and acceptor splice sites and stop codons, is a relevant part of many current problems in bioinformatics. The best approaches use sophisticated classifiers, such as support vector machines. However, with the rapid accumulation of sequence data, methods for combining many sources of evidence are necessary as it is unlikely that a single classifier can solve this problem with the best possible performance. A major issue is that the number of possible models to combine is large and the use of all of these models is impractical. In this paper we present a methodology for combining many sources of information to recognize any functional site using "floating search", a powerful heuristics applicable when the cost of evaluating each solution is high. We present experiments on four functional sites in the human genome, which is used as the target genome, and use another 20 species as sources of evidence. The proposed methodology shows significant improvement over state-of-the-art methods. The results show an advantage of the proposed method and also challenge the standard assumption of using only genomes not very close and not very far from the human to improve the recognition of functional sites.
Collapse
|
9
|
Keith JA, Vassilev-Galindo V, Cheng B, Chmiela S, Gastegger M, Müller KR, Tkatchenko A. Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems. Chem Rev 2021; 121:9816-9872. [PMID: 34232033 PMCID: PMC8391798 DOI: 10.1021/acs.chemrev.1c00107] [Citation(s) in RCA: 186] [Impact Index Per Article: 62.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Indexed: 12/23/2022]
Abstract
Machine learning models are poised to make a transformative impact on chemical sciences by dramatically accelerating computational algorithms and amplifying insights available from computational chemistry methods. However, achieving this requires a confluence and coaction of expertise in computer science and physical sciences. This Review is written for new and experienced researchers working at the intersection of both fields. We first provide concise tutorials of computational chemistry and machine learning methods, showing how insights involving both can be achieved. We follow with a critical review of noteworthy applications that demonstrate how computational chemistry and machine learning can be used together to provide insightful (and useful) predictions in molecular and materials modeling, retrosyntheses, catalysis, and drug design.
Collapse
Affiliation(s)
- John A. Keith
- Department
of Chemical and Petroleum Engineering Swanson School of Engineering, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Valentin Vassilev-Galindo
- Department
of Physics and Materials Science, University
of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Bingqing Cheng
- Accelerate
Programme for Scientific Discovery, Department
of Computer Science and Technology, 15 J. J. Thomson Avenue, Cambridge CB3 0FD, United Kingdom
| | - Stefan Chmiela
- Department
of Software Engineering and Theoretical Computer Science, Technische Universität Berlin, 10587, Berlin, Germany
| | - Michael Gastegger
- Department
of Software Engineering and Theoretical Computer Science, Technische Universität Berlin, 10587, Berlin, Germany
| | - Klaus-Robert Müller
- Machine
Learning Group, Technische Universität
Berlin, 10587, Berlin, Germany
- Department
of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul, 02841, Korea
- Max-Planck-Institut für Informatik, 66123 Saarbrücken, Germany
- Google Research, Brain Team, 10117 Berlin, Germany
| | - Alexandre Tkatchenko
- Department
of Physics and Materials Science, University
of Luxembourg, L-1511 Luxembourg City, Luxembourg
| |
Collapse
|
10
|
ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations. Comput Biol Chem 2021; 93:107537. [PMID: 34217007 DOI: 10.1016/j.compbiolchem.2021.107537] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 05/09/2021] [Accepted: 06/26/2021] [Indexed: 01/08/2023]
Abstract
MOTIVATION Primary and secondary active transport are two types of active transport that involve using energy to move the substances. Active transport mechanisms do use proteins to assist in transport and play essential roles to regulate the traffic of ions or small molecules across a cell membrane against the concentration gradient. In this study, the two main types of proteins involved in such transport are classified from transmembrane transport proteins. We propose a Support Vector Machine (SVM) with contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) to represent protein sequences. BERT is a powerful model in transfer learning, a deep learning language representation model developed by Google and one of the highest performing pre-trained model for Natural Language Processing (NLP) tasks. The idea of transfer learning with pre-trained model from BERT is applied to extract fixed feature vectors from the hidden layers and learn contextual relations between amino acids in the protein sequence. Therefore, the contextualized word representations of proteins are introduced to effectively model complex structures of amino acids in the sequence and the variations of these amino acids in the context. By generating context information, we capture multiple meanings for the same amino acid to reveal the importance of specific residues in the protein sequence. RESULTS The performance of the proposed method is evaluated using five-fold cross-validation and independent test. The proposed method achieves an accuracy of 85.44 %, 88.74 % and 92.84 % for Class-1, Class-2, and Class-3, respectively. Experimental results show that this approach can outperform from other feature extraction methods using context information, effectively classify two types of active transport and improve the overall performance.
Collapse
|
11
|
Karollus A, Avsec Ž, Gagneur J. Predicting mean ribosome load for 5'UTR of any length using deep learning. PLoS Comput Biol 2021; 17:e1008982. [PMID: 33970899 PMCID: PMC8136849 DOI: 10.1371/journal.pcbi.1008982] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 05/20/2021] [Accepted: 04/19/2021] [Indexed: 01/07/2023] Open
Abstract
The 5’ untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5’UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL)—a proxy for translation rate—directly from 5’UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5’UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5’UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants. The human genome carries a complex code. It consists of genes, which provide blueprints to assemble proteins, and regulatory elements, which control when, where, and how often particular genes are transcribed and translated into protein. To read the genome correctly and specifically to find the causes of inherited diseases, we need to be able to find and interpret these regulatory elements. Here, we focus on particular regions of the genome, the so-called 5’ untranslated regions, which play an important role in determining how often a transcribed gene is translated into protein. We develop deep learning models which can quantitatively interpret regulatory elements in human 5’ untranslated regions and use this information to predict a proxy of the translation efficiency. Our model generalizes a previous model to 5’ untranslated regions of any length, just as they are encountered in natural human genes. Because this model requires only the sequence as input, it can give estimates for the impact of mutations in the sequence, even if these particular mutations are very rare or entirely novel. Such estimates could help pinpoint mutations that disrupt the normal functioning of gene regulation, which could be used to better diagnose patients suffering from rare genetic disorders.
Collapse
Affiliation(s)
- Alexander Karollus
- Department of Informatics, Technical University of Munich, Garching, Germany
| | - Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany
- Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany
- Institute of Human Genetics, Technical University of Munich, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany
- * E-mail:
| |
Collapse
|
12
|
Wei C, Zhang J, Yuan X, He Z, Liu G, Wu J. NeuroTIS: Enhancing the prediction of translation initiation sites in mRNA sequences via a hybrid dependency network and deep learning framework. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2020.106459] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
13
|
Goel N, Singh S, Aseri TC. Global sequence features based translation initiation site prediction in human genomic sequences. Heliyon 2020; 6:e04825. [PMID: 32964155 PMCID: PMC7490824 DOI: 10.1016/j.heliyon.2020.e04825] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2019] [Revised: 05/25/2020] [Accepted: 08/26/2020] [Indexed: 11/26/2022] Open
Abstract
Gene prediction has been increasingly important in genome annotation due to advancements in sequencing technology. Genome annotation further helps in determining the structure and function of these genes. Translation initiation site prediction (TIS) in human genomic sequences is one of the fundamental and essential steps in gene prediction. Thus, accurate prediction of TIS in these sequences is highly desirable. Although many computational methods were developed for this problem, none of them focused on finding these sites in human genomic sequences. In this paper, a new TIS prediction method is proposed by incorporating global sequence based features. Support vector machine is used to assess the prediction power of these features. The proposed method achieved accuracy of above 90% when tested for genomic as well as cDNA sequences. The experimental results indicate that the method works well for both genomic and cDNA sequences. The method can be integrated into gene prediction system in future.
Collapse
Affiliation(s)
- Neelam Goel
- Department of Information Technology, University Institute of Engineering and Technology, Sector-25, Panjab University, Chandigarh 160014, India
| | - Shailendra Singh
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector-12, Chandigarh 160012, India
| | - Trilok Chand Aseri
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector-12, Chandigarh 160012, India
| |
Collapse
|
14
|
Kao HJ, Nguyen VN, Huang KY, Chang WC, Lee TY. SuccSite: Incorporating Amino Acid Composition and Informative k-spaced Amino Acid Pairs to Identify Protein Succinylation Sites. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 18:208-219. [PMID: 32592791 PMCID: PMC7647693 DOI: 10.1016/j.gpb.2018.10.010] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Revised: 10/01/2018] [Accepted: 10/11/2018] [Indexed: 12/14/2022]
Abstract
Protein succinylation is a biochemical reaction in which a succinyl group (-CO-CH2-CH2-CO-) is attached to the lysine residue of a protein molecule. Lysine succinylation plays important regulatory roles in living cells. However, studies in this field are limited by the difficulty in experimentally identifying the substrate site specificity of lysine succinylation. To facilitate this process, several tools have been proposed for the computational identification of succinylated lysine sites. In this study, we developed an approach to investigate the substrate specificity of lysine succinylated sites based on amino acid composition. Using experimentally verified lysine succinylated sites collected from public resources, the significant differences in position-specific amino acid composition between succinylated and non-succinylated sites were represented using the Two Sample Logo program. These findings enabled the adoption of an effective machine learning method, support vector machine, to train a predictive model with not only the amino acid composition, but also the composition of k-spaced amino acid pairs. After the selection of the best model using a ten-fold cross-validation approach, the selected model significantly outperformed existing tools based on an independent dataset manually extracted from published research articles. Finally, the selected model was used to develop a web-based tool, SuccSite, to aid the study of protein succinylation. Two proteins were used as case studies on the website to demonstrate the effective prediction of succinylation sites. We will regularly update SuccSite by integrating more experimental datasets. SuccSite is freely accessible at http://csb.cse.yzu.edu.tw/SuccSite/.
Collapse
Affiliation(s)
- Hui-Ju Kao
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan 32003, Taiwan, China
| | - Van-Nui Nguyen
- Department of Information Technology, University of Information and Communication Technology, Thai Nguyen 1000, Vietnam
| | - Kai-Yao Huang
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China; Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Wen-Chi Chang
- Institute of Tropical Plant Sciences, Cheng Kung University, Tainan 701, Taiwan, China
| | - Tzong-Yi Lee
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China; Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen 518172, China.
| |
Collapse
|
15
|
Yin T, König S. Genomic predictions of growth curves in Holstein dairy cattle based on parameter estimates from nonlinear models combined with different kernel functions. J Dairy Sci 2020; 103:7222-7237. [PMID: 32534925 DOI: 10.3168/jds.2019-18010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Accepted: 04/06/2020] [Indexed: 11/19/2022]
Abstract
Availability of longitudinal body weight (BW) records allows the application of nonlinear models (NLINM) to predict phenotypic and genomic growth curves in dairy cattle. In this regard, we considered a data set including 31,722 BW records from 4,952 female Holstein cattle, during the period from birth (mo 0) to approximately age at first calving (mo 24). Parameters of the growth curves were estimated using 3 NLINM: the logistic (LOG), the Gompertz (GOM), and the Richards (RICH) functions. Residuals for the growth curve parameters from the NLINM applications were used as pseudo-phenotypes in the ongoing genomic analyses with different similarity matrices, including 2 genomic relationship matrices (G1 and G2), a combined pedigree and genomic relationship matrix (H), and 3 kernel matrices. The kernels were a weighted "alike by state" kernel function (K1), an exponential dissimilarity kernel (K2), and a Gaussian kernel (K3). On the basis of G1 and G2 matrices, genomic heritabilities for the growth curve parameters birth weight (W0), mature weight (Wm), and growth rate (k), and the shape parameter (m; only available from RICH) were moderate to large, in the range from 0.29 (m from RICH) to 0.46 (k from RICH). Fitting the similarity matrices based on kernel functions contributed to an increase of the ratio of the variance explained by the similarity matrix in relation to the total variance (compared with the heritability when modeling G1 or G2). Genetic correlations between W0, Wm, and k were always positive (>0.30), especially for the same growth curve parameters estimated from different NLINM (>0.90). The shape parameter m from RICH was negatively correlated with other growth curve parameters, from -0.29 to -0.95. In a next step, estimated genomic breeding values for growth curve parameters were input data for the respective NLINM, aiming to construct genomic growth curves. Prediction accuracies were correlations between genomic growth curves and genomic breeding values from random regression models for sires and female cattle. Considering all genotyped female cattle with pseudo-phenotypes, prediction accuracies were larger from RICH than from LOG and GOM. However, differences in prediction accuracies from the NLINM × similarity matrix combinations were quite small. Accordingly, in 5-fold cross-validations using heifer groups with masked phenotypes, very similar prediction accuracies across modeling approaches were identified. Especially for specific age months, genomic growth curve predictions were more accurate for sires than for female cattle, indicating that the relationships between animals in training and validation sets are more important than the selection of specific NLINM × similarity matrix combinations.
Collapse
Affiliation(s)
- T Yin
- Institute of Animal Breeding and Genetics, Justus-Liebig-University Gießen, 35390 Gießen, Germany
| | - S König
- Institute of Animal Breeding and Genetics, Justus-Liebig-University Gießen, 35390 Gießen, Germany.
| |
Collapse
|
16
|
milRNApredictor: Genome-free prediction of fungi milRNAs by incorporating k-mer scheme and distance-dependent pair potential. Genomics 2019; 112:2233-2240. [PMID: 31884158 DOI: 10.1016/j.ygeno.2019.12.019] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 12/05/2019] [Accepted: 12/25/2019] [Indexed: 11/22/2022]
Abstract
MicroRNA-like small RNAs (milRNAs) with length of 21-22 nucleotides are a type of small non-coding RNAs that are firstly found in Neurospora crassa in 2010. Identifying milRNAs of species without genomic information is a difficult problem. Here, knowledge-based energy features are developed to identify milRNAs by tactfully incorporating k-mer scheme and distance-dependent pair potential. Compared with k-mer scheme, features developed here can alleviate the inherent curse of dimensionality in k-scheme once k becomes large. In addition, milRNApredictor built on novel features performs comparably to k-mer scheme, and achieves sensitivity of 74.21%, and specificity of 75.72% based on 10-fold cross-validation. Furthermore, for novel miRNA prediction, there exists high overlap of results from milRNApredictor and state-of-the-art mirnovo. However, milRNApredictor is simpler to use with reduced requirements of input data and dependencies. Taken together, milRNApredictor can be used to de novo identify fungi milRNAs and other very short small RNAs of non-model organisms.
Collapse
|
17
|
Sun S, Wang C, Ding H, Zou Q. Machine learning and its applications in plant molecular studies. Brief Funct Genomics 2019; 19:40-48. [DOI: 10.1093/bfgp/elz036] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Revised: 09/06/2019] [Accepted: 09/15/2019] [Indexed: 01/16/2023] Open
Abstract
Abstract
The advent of high-throughput genomic technologies has resulted in the accumulation of massive amounts of genomic information. However, biologists are challenged with how to effectively analyze these data. Machine learning can provide tools for better and more efficient data analysis. Unfortunately, because many plant biologists are unfamiliar with machine learning, its application in plant molecular studies has been restricted to a few species and a limited set of algorithms. Thus, in this study, we provide the basic steps for developing machine learning frameworks and present a comprehensive overview of machine learning algorithms and various evaluation metrics. Furthermore, we introduce sources of important curated plant genomic data and R packages to enable plant biologists to easily and quickly apply appropriate machine learning algorithms in their research. Finally, we discuss current applications of machine learning algorithms for identifying various genes related to resistance to biotic and abiotic stress. Broad application of machine learning and the accumulation of plant sequencing data will advance plant molecular studies.
Collapse
Affiliation(s)
- Shanwen Sun
- University of Bayreuth in Germany. He is now a postdoctoral fellow at the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
| | - Chunyu Wang
- Harbin Institute of Technology in China. He is an associate professor in the School of Computer Science and Technology, Harbin Institute of Technology
| | - Hui Ding
- Inner Mongolia University in China. She is an associate professor in the Center for Informational Biology, University of Electronic Science and Technology of China
| | - Quan Zou
- Harbin Institute of Technology in China. He is a professor in the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
| |
Collapse
|
18
|
Huang KY, Hsu JBK, Lee TY. Characterization and Identification of Lysine Succinylation Sites based on Deep Learning Method. Sci Rep 2019; 9:16175. [PMID: 31700141 PMCID: PMC6838336 DOI: 10.1038/s41598-019-52552-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 10/18/2019] [Indexed: 12/14/2022] Open
Abstract
Succinylation is a type of protein post-translational modification (PTM), which can play important roles in a variety of cellular processes. Due to an increasing number of site-specific succinylated peptides obtained from high-throughput mass spectrometry (MS), various tools have been developed for computationally identifying succinylated sites on proteins. However, most of these tools predict succinylation sites based on traditional machine learning methods. Hence, this work aimed to carry out the succinylation site prediction based on a deep learning model. The abundance of MS-verified succinylated peptides enabled the investigation of substrate site specificity of succinylation sites through sequence-based attributes, such as position-specific amino acid composition, the composition of k-spaced amino acid pairs (CKSAAP), and position-specific scoring matrix (PSSM). Additionally, the maximal dependence decomposition (MDD) was adopted to detect the substrate signatures of lysine succinylation sites by dividing all succinylated sequences into several groups with conserved substrate motifs. According to the results of ten-fold cross-validation, the deep learning model trained using PSSM and informative CKSAAP attributes can reach the best predictive performance and also perform better than traditional machine-learning methods. Moreover, an independent testing dataset that truly did not exist in the training dataset was used to compare the proposed method with six existing prediction tools. The testing dataset comprised of 218 positive and 2621 negative instances, and the proposed model could yield a promising performance with 84.40% sensitivity, 86.99% specificity, 86.79% accuracy, and an MCC value of 0.489. Finally, the proposed method has been implemented as a web-based prediction tool (CNN-SuccSite), which is now freely accessible at http://csb.cse.yzu.edu.tw/CNN-SuccSite/.
Collapse
Affiliation(s)
- Kai-Yao Huang
- Department of Medical Research, Hsinchu Mackay Memorial Hospital, Hsinchu city, 300, Taiwan
| | - Justin Bo-Kai Hsu
- Department of Medical Research, Taipei Medical University Hospital, Taipei city, 110, Taiwan
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, 518172, China. .,School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, 518172, China.
| |
Collapse
|
19
|
Xu H, He L, Zhong B, Qiu J, Tu J. Classification and prediction of inertial cavitation activity induced by pulsed high-intensity focused ultrasound. ULTRASONICS SONOCHEMISTRY 2019; 56:77-83. [PMID: 31101291 DOI: 10.1016/j.ultsonch.2019.03.031] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/19/2018] [Revised: 02/02/2019] [Accepted: 03/31/2019] [Indexed: 06/09/2023]
Abstract
Classification and prediction of ultrasound-induced microbubble inertial cavitation (IC) activity may play an important role in better design of ultrasound treatment strategy with improved efficiency and safety. Here, a new method was proposed by combining support vector machine (SVM) algorithm with passive cavitation detection (PCD) measurements to fulfill the tasks of IC event classification and IC dose prediction. By using the PCD system, IC thresholds and IC doses were firstly measured for various ultrasound contrast agent (UCA) solutions exposed to pulsed high-intensity focused ultrasound (pHIFU) at different driving pressures and pulse lengths. Then, after trained and tested by measured data, two SVM models (viz. C-SVC and ε-SVR) were established to classify the likelihood of IC event occurrence and predict IC dose, respectively, under different parameter conditions. The findings of this study indicate that the combination of SVM and PCD could be used as a useful tool to optimize the operation strategy of cavitation-facilitated pHIFU therapy.
Collapse
Affiliation(s)
- Huan Xu
- National Institute of Metrology, Beijing 100029, China
| | - Longbiao He
- National Institute of Metrology, Beijing 100029, China
| | - Bo Zhong
- National Institute of Metrology, Beijing 100029, China
| | - Jianmin Qiu
- Zhejiang Institute of Metrology, Hangzhou 310018, China
| | - Juan Tu
- Key Laboratory of Modern Acoustics (MOE), Department of Physics, Collaborative Innovation Center of Advanced Microstructure, Nanjing University, Nanjing 210093, China.
| |
Collapse
|
20
|
Wahba MA, Ashour AS, Guo Y, Napoleon SA, Elnaby MMA. A novel cumulative level difference mean based GLDM and modified ABCD features ranked using eigenvector centrality approach for four skin lesion types classification. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 165:163-174. [PMID: 30337071 DOI: 10.1016/j.cmpb.2018.08.009] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2018] [Revised: 07/20/2018] [Accepted: 08/08/2018] [Indexed: 06/08/2023]
Abstract
BACKGROUND AND OBJECTIVE Melanoma is one of the major death causes while basal cell carcinoma (BCC) is the utmost incident skin lesion type. At their early stages, medical experts may be confused between both types with benign nevus and pigmented benign keratoses (BKL). This inspired the current study to develop an accurate automated, user-friendly skin lesion identification system. METHODS The current work targets a novel discrimination technique of four pre-mentioned skin lesion classes. A novel proposed texture feature, named cumulative level-difference mean (CLDM) based on the gray-level difference method (GLDM) is extracted. The asymmetry, border irregularity, color variation and diameter are summed up as the ABCD rule feature vector is originally used to classify the melanoma from benign lesions. The proposed method improved the ABCD rule to also classify BCC and BKL by using the proposed modified-ABCD feature vector. In the modified set of ABCD features, each border feature, such as compact index, fractal dimension, and edge abruptness is considered a separate feature. Then, the composite feature vector having the pre-mentioned features is ranked using the Eigenvector Centrality (ECFS) feature ranking method. The ranked features are then classified by a cubic support vector machine for different numbers of selected features. RESULTS The proposed CLDM texture features combined with the ranked ABCD features achieved outstanding performance to classify the four targeted classes (melanoma, BCC, nevi and BKL). The results report 100% outstanding performance of the sensitivity, accuracy and specificity per each class compared to other features when using the highest seven ranked features. CONCLUSIONS The proposed system established that Melanoma, BCC, nevus and BKL are efficiently classified using cubic SVM with the new feature set. In addition, the comparative studies proved the superiority of the cubic SVM to classify the four classes.
Collapse
MESH Headings
- Algorithms
- Carcinoma, Basal Cell/classification
- Carcinoma, Basal Cell/diagnostic imaging
- Carcinoma, Basal Cell/pathology
- Carcinoma, Squamous Cell/classification
- Carcinoma, Squamous Cell/diagnostic imaging
- Carcinoma, Squamous Cell/pathology
- Databases, Factual
- Dermoscopy/methods
- Diagnosis, Computer-Assisted/methods
- Diagnosis, Computer-Assisted/statistics & numerical data
- Diagnosis, Differential
- Fractals
- Humans
- Image Interpretation, Computer-Assisted/methods
- Image Interpretation, Computer-Assisted/statistics & numerical data
- Keratosis/classification
- Keratosis/diagnostic imaging
- Keratosis/pathology
- Melanoma/classification
- Melanoma/diagnostic imaging
- Melanoma/pathology
- Nevus, Pigmented/classification
- Nevus, Pigmented/diagnostic imaging
- Nevus, Pigmented/pathology
- Pattern Recognition, Automated/methods
- Pattern Recognition, Automated/statistics & numerical data
- Skin/diagnostic imaging
- Skin/pathology
- Skin Diseases/classification
- Skin Diseases/diagnostic imaging
- Skin Diseases/pathology
- Skin Neoplasms/classification
- Skin Neoplasms/diagnostic imaging
- Skin Neoplasms/pathology
- Support Vector Machine
Collapse
Affiliation(s)
- Maram A Wahba
- Department of Electronics and Electrical Communications Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
| | - Amira S Ashour
- Department of Electronics and Electrical Communications Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
| | - Yanhui Guo
- Department of Computer Science, University of Illinois at Springfield, Springfield, IL, USA.
| | - Sameh A Napoleon
- Department of Electronics and Electrical Communications Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
| | - Mustafa M Abd Elnaby
- Department of Electronics and Electrical Communications Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
| |
Collapse
|
21
|
Jamalabadi H, Alizadeh S, Schönauer M, Leibold C, Gais S. Multivariate classification of neuroimaging data with nested subclasses: Biased accuracy and implications for hypothesis testing. PLoS Comput Biol 2018; 14:e1006486. [PMID: 30260958 PMCID: PMC6177201 DOI: 10.1371/journal.pcbi.1006486] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2017] [Revised: 10/09/2018] [Accepted: 09/03/2018] [Indexed: 11/29/2022] Open
Abstract
Biological data sets are typically characterized by high dimensionality and low effect sizes. A powerful method for detecting systematic differences between experimental conditions in such multivariate data sets is multivariate pattern analysis (MVPA), particularly pattern classification. However, in virtually all applications, data from the classes that correspond to the conditions of interest are not homogeneous but contain subclasses. Such subclasses can for example arise from individual subjects that contribute multiple data points, or from correlations of items within classes. We show here that in multivariate data that have subclasses nested within its class structure, these subclasses introduce systematic information that improves classifiability beyond what is expected by the size of the class difference. We analytically prove that this subclass bias systematically inflates correct classification rates (CCRs) of linear classifiers depending on the number of subclasses as well as on the portion of variance induced by the subclasses. In simulations, we demonstrate that subclass bias is highest when between-class effect size is low and subclass variance high. This bias can be reduced by increasing the total number of subclasses. However, we can account for the subclass bias by using permutation tests that explicitly consider the subclass structure of the data. We illustrate our result in several experiments that recorded human EEG activity, demonstrating that parametric statistical tests as well as typical trial-wise permutation fail to determine significance of classification outcomes correctly. When data are analyzed using multivariate pattern classification, any systematic similarities between subsets of trials (e.g. shared physical properties among a subgroup of stimuli, trials belonging to the same session or subject, etc.) form distinct nested subclasses within each class. Pattern classification is sensitive to this kind of structure in the data and uses such groupings to increase classification accuracies even when data from both conditions are sampled from the same distribution, i.e. the null hypothesis is true. Here, we show that the bias is higher for larger subclass variances and that it is directly related to the number of subclasses and the intraclass correlation (ICC). Because the increased classification accuracy in such data sets is not based on class differences, the null distribution should be adjusted to account for this type of bias. To do so, we propose to use blocked permutation testing on subclass levels and show that it can confine the false positive rate to the predefined α-levels.
Collapse
Affiliation(s)
- Hamidreza Jamalabadi
- Medical Psychology and Behavioral Neurobiology, University of Tübingen, Tübingen, Germany
- Bernstein Center for Computational Neuroscience, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany
- IMPRS for Cognitive and Systems Neuroscience, University of Tübingen, Tübingen, Germany
- Department of Psychiatry, Division for Translational Psychiatry, University of Tübingen, Tübingen, Germany
| | - Sarah Alizadeh
- Medical Psychology and Behavioral Neurobiology, University of Tübingen, Tübingen, Germany
- Bernstein Center for Computational Neuroscience, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany
- IMPRS for Cognitive and Systems Neuroscience, University of Tübingen, Tübingen, Germany
- Department of Psychiatry, Division for Translational Psychiatry, University of Tübingen, Tübingen, Germany
| | - Monika Schönauer
- Medical Psychology and Behavioral Neurobiology, University of Tübingen, Tübingen, Germany
- Bernstein Center for Computational Neuroscience, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany
- Department of Psychology, Ludwig-Maximilians-Universität München, München, Germany
| | - Christian Leibold
- Bernstein Center for Computational Neuroscience, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany
- Department of Biology II, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany
| | - Steffen Gais
- Medical Psychology and Behavioral Neurobiology, University of Tübingen, Tübingen, Germany
- Bernstein Center for Computational Neuroscience, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany
- Department of Psychology, Ludwig-Maximilians-Universität München, München, Germany
- * E-mail:
| |
Collapse
|
22
|
Abstract
Codon usage depends on mutation bias, tRNA-mediated selection, and the need for high efficiency and accuracy in translation. One codon in a synonymous codon family is often strongly over-used, especially in highly expressed genes, which often leads to a high dN/dS ratio because dS is very small. Many different codon usage indices have been proposed to measure codon usage and codon adaptation. Sense codon could be misread by release factors and stop codons misread by tRNAs, which also contribute to codon usage in rare cases. This chapter outlines the conceptual framework on codon evolution, illustrates codon-specific and gene-specific codon usage indices, and presents their applications. A new index for codon adaptation that accounts for background mutation bias (Index of Translation Elongation) is presented and contrasted with codon adaptation index (CAI) which does not consider background mutation bias. They are used to re-analyze data from a recent paper claiming that translation elongation efficiency matters little in protein production. The reanalysis disproves the claim.
Collapse
|
23
|
Abstract
Abstract
Next Generation Sequencing (NGS) or deep sequencing technology enables parallel reading of multiple individual DNA fragments, thereby enabling the identification of millions of base pairs in several hours. Recent research has clearly shown that machine learning technologies can efficiently analyse large sets of genomic data and help to identify novel gene functions and regulation regions. A deep artificial neural network consists of a group of artificial neurons that mimic the properties of living neurons. These mathematical models, termed Artificial Neural Networks (ANN), can be used to solve artificial intelligence engineering problems in several different technological fields (e.g., biology, genomics, proteomics, and metabolomics). In practical terms, neural networks are non-linear statistical structures that are organized as modelling tools and are used to simulate complex genomic relationships between inputs and outputs. To date, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNN) have been demonstrated to be the best tools for improving performance in problem solving tasks within the genomic field.
Collapse
|
24
|
Abnormal neural activity as a potential biomarker for drug-naive first-episode adolescent-onset schizophrenia with coherence regional homogeneity and support vector machine analyses. Schizophr Res 2018; 192:408-415. [PMID: 28476336 DOI: 10.1016/j.schres.2017.04.028] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/19/2016] [Revised: 04/12/2017] [Accepted: 04/14/2017] [Indexed: 12/19/2022]
Abstract
BACKGROUND Patients with adolescent-onset schizophrenia (AOS) hold the same but severe form of symptoms with adult-onset schizophrenia, and with worse outcome and poor treatment response to antipsychotics. Several dominant brain regions of schizophrenia patients show significantly abnormal structural and functional connectivity during resting-state scans. However, coherence regional homogeneity (Cohe-ReHo) in drug-naive first-episode patients with AOS remains unclear. METHOD A total of 48 drug-naive first-episode AOS outpatients and 31 healthy controls underwent resting-state functional magnetic resonance scans. Cohe-ReHo and support vector machine analyses were used to analyze the data. RESULTS Compared with the healthy controls, the AOS group showed significantly decreased Cohe-ReHo values distributed over brain regions, including the left postcentral gyrus, left superior temporal gyrus, left paracentral lobule, right precentral gyrus, right inferior parietal lobule (IPL), right middle frontal gyrus, and bilateral precuneus. No region with increased Cohe-ReHo values was observed in the AOS group compared with healthy controls. In addition, the right IPL was correlated with fluency (r=-0.324, p=0.030). However, the correlation was not significant after the Bonferroni correction at p<0.0083 (0.05/6). A combination of the Cohe-ReHo values in the bilateral precuneus and right IPL discriminated the patients from controls with the sensitivity, specificity, and accuracy of 91.67%, 87.10%, and 89.87%, respectively. CONCLUSION Our findings suggested that the AOS patients exhibited diminished Cohe-ReHo values in some regions within the DMN network and sensorimotor network. The abnormalities in particular brain regions (bilateral precuneus and right IPL) may serve as potential biomarkers for AOS.
Collapse
|
25
|
Liu Y, Guo W, Zhang Y, Lv L, Hu F, Wu R, Zhao J. Decreased Resting-State Interhemispheric Functional Connectivity Correlated with Neurocognitive Deficits in Drug-Naive First-Episode Adolescent-Onset Schizophrenia. Int J Neuropsychopharmacol 2017; 21:33-41. [PMID: 29228204 PMCID: PMC5795351 DOI: 10.1093/ijnp/pyx095] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Accepted: 10/19/2017] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Given that adolescence is a critical epoch in the onset of schizophrenia, studying aberrant brain changes in adolescent-onset schizophrenia, particularly in patients with drug-naive first-episode schizophrenia, is important to understand the biological mechanism of this disorder. Previous resting-state functional magnetic resonance imaging studies have shown abnormal functional connectivity in separate hemispheres in patients with adult-onset schizophrenia. Our aim to study adolescent-onset schizophrenia can provide clues for the early aetiology of schizophrenia. METHOD A total of 48 drug-naïve, first-episode, adolescent-onset schizophrenia outpatients and 31 healthy controls underwent resting-state functional magnetic resonance imaging scans. Data were subjected to voxel-mirrored homotopic connectivity and support vector machine analyses. RESULTS Compared with the healthy controls, the adolescent-onset schizophrenia group showed significantly lower voxel-mirrored homotopic connectivity values in different brain regions, including the fusiform gyrus, superior temporal gyrus/insula, precentral gyrus, and precuneus. Decreased voxel-mirrored homotopic connectivity values in the superior temporal gyrus/insula were significantly correlated with Trail-Making Test: Part A performance (r = -0.437, P = .002). A combination of the voxel-mirrored homotopic connectivity values in the precentral gyrus and precuneus may be used to discriminate patients with adolescent-onset schizophrenia from controls with satisfactory classification results, which showed sensitivity of 100%, specificity of 87.09%, and accuracy of 94.93%. CONCLUSION Our findings highlight resting-state interhemispheric FC abnormalities within the sensorimotor network of patients with adolescent-onset schizophrenia and confirm the relationship between adolescent-onset schizophrenia and adult-onset schizophrenia. These findings suggest that reduced interhemispheric connectivity within the sensorimotor network has a pivotal role in the pathogenesis of schizophrenia.
Collapse
Affiliation(s)
- Yi Liu
- Department of Psychiatry, the Second Xiangya Hospital, Central South University, Changsha, Hunan,Mental Health Institute of the Second Xiangya Hospital, Central South University, Changsha, Hunan, China,National Clinical Research Center on Mental Disorders, Changsha, Hunan, China,National Technology Institute on Mental Disorders, Changsha, Hunan, China,Hunan Key Laboratory of Psychiatry and Mental Health, Changsha, Hunan, China
| | - Wenbin Guo
- Department of Psychiatry, the Second Xiangya Hospital, Central South University, Changsha, Hunan,Mental Health Institute of the Second Xiangya Hospital, Central South University, Changsha, Hunan, China,National Clinical Research Center on Mental Disorders, Changsha, Hunan, China,National Technology Institute on Mental Disorders, Changsha, Hunan, China,Hunan Key Laboratory of Psychiatry and Mental Health, Changsha, Hunan, China
| | - Yan Zhang
- Henan Key Laboratory of Biological Psychiatry, Henan Mental Hospital, Second Affiliated Hospital of Xinxiang Medical University, Xinxiang, China
| | - Luxian Lv
- Henan Key Laboratory of Biological Psychiatry, Henan Mental Hospital, Second Affiliated Hospital of Xinxiang Medical University, Xinxiang, China
| | - Feihu Hu
- Department of Psychiatry, the Second Xiangya Hospital, Central South University, Changsha, Hunan,Mental Health Institute of the Second Xiangya Hospital, Central South University, Changsha, Hunan, China,National Clinical Research Center on Mental Disorders, Changsha, Hunan, China,National Technology Institute on Mental Disorders, Changsha, Hunan, China,Hunan Key Laboratory of Psychiatry and Mental Health, Changsha, Hunan, China
| | - Renrong Wu
- Department of Psychiatry, the Second Xiangya Hospital, Central South University, Changsha, Hunan,Mental Health Institute of the Second Xiangya Hospital, Central South University, Changsha, Hunan, China,National Clinical Research Center on Mental Disorders, Changsha, Hunan, China,National Technology Institute on Mental Disorders, Changsha, Hunan, China,Hunan Key Laboratory of Psychiatry and Mental Health, Changsha, Hunan, China
| | - Jingping Zhao
- Department of Psychiatry, the Second Xiangya Hospital, Central South University, Changsha, Hunan,Henan Key Laboratory of Biological Psychiatry, Henan Mental Hospital, Second Affiliated Hospital of Xinxiang Medical University, Xinxiang, China,Mental Health Institute of the Second Xiangya Hospital, Central South University, Changsha, Hunan, China,National Clinical Research Center on Mental Disorders, Changsha, Hunan, China,National Technology Institute on Mental Disorders, Changsha, Hunan, China,Hunan Key Laboratory of Psychiatry and Mental Health, Changsha, Hunan, China,Correspondence: Jingping Zhao, MD, Department of Psychiatry, the Second Xiangya Hospital, Central South University, Changsha, Hunan 410011, China ()
| |
Collapse
|
26
|
Wahba MA, Ashour AS, Napoleon SA, Abd Elnaby MM, Guo Y. Combined empirical mode decomposition and texture features for skin lesion classification using quadratic support vector machine. Health Inf Sci Syst 2017; 5:10. [PMID: 29142740 DOI: 10.1007/s13755-017-0033-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2017] [Accepted: 10/16/2017] [Indexed: 11/30/2022] Open
Abstract
Purpose Basal cell carcinoma is one of the most common malignant skin lesions. Automated lesion identification and classification using image processing techniques is highly required to reduce the diagnosis errors. Methods In this study, a novel technique is applied to classify skin lesion images into two classes, namely the malignant Basal cell carcinoma and the benign nevus. A hybrid combination of bi-dimensional empirical mode decomposition and gray-level difference method features is proposed after hair removal. The combined features are further classified using quadratic support vector machine (Q-SVM). Results The proposed system has achieved outstanding performance of 100% accuracy, sensitivity and specificity compared to other support vector machine procedures as well as with different extracted features. Conclusion Basal Cell Carcinoma is effectively classified using Q-SVM with the proposed combined features.
Collapse
Affiliation(s)
- Maram A Wahba
- Department of Electronics and Electrical Communications Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
| | - Amira S Ashour
- Department of Electronics and Electrical Communications Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
| | - Sameh A Napoleon
- Department of Electronics and Electrical Communications Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
| | - Mustafa M Abd Elnaby
- Department of Electronics and Electrical Communications Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
| | - Yanhui Guo
- Department of Computer Science, University of Illinois at Springfield, Springfield, IL USA
| |
Collapse
|
27
|
Abstract
MOTIVATION Translation initiation is a key step in the regulation of gene expression. In addition to the annotated translation initiation sites (TISs), the translation process may also start at multiple alternative TISs (including both AUG and non-AUG codons), which makes it challenging to predict TISs and study the underlying regulatory mechanisms. Meanwhile, the advent of several high-throughput sequencing techniques for profiling initiating ribosomes at single-nucleotide resolution, e.g. GTI-seq and QTI-seq, provides abundant data for systematically studying the general principles of translation initiation and the development of computational method for TIS identification. METHODS We have developed a deep learning-based framework, named TITER, for accurately predicting TISs on a genome-wide scale based on QTI-seq data. TITER extracts the sequence features of translation initiation from the surrounding sequence contexts of TISs using a hybrid neural network and further integrates the prior preference of TIS codon composition into a unified prediction framework. RESULTS Extensive tests demonstrated that TITER can greatly outperform the state-of-the-art prediction methods in identifying TISs. In addition, TITER was able to identify important sequence signatures for individual types of TIS codons, including a Kozak-sequence-like motif for AUG start codon. Furthermore, the TITER prediction score can be related to the strength of translation initiation in various biological scenarios, including the repressive effect of the upstream open reading frames on gene expression and the mutational effects influencing translation initiation efficiency. AVAILABILITY AND IMPLEMENTATION TITER is available as an open-source software and can be downloaded from https://github.com/zhangsaithu/titer . CONTACT lzhang20@mail.tsinghua.edu.cn or zengjy321@tsinghua.edu.cn. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sai Zhang
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Hailin Hu
- School of Medicine, Tsinghua University, Beijing, China
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA, USA
- MOE Key Lab of Bioinformatics and Bioinformatics Division, TNLIST/Department of Computer Science and Technology, Tsinghua University, Beijing, China
- Institute of Integrative Genome Biology, University of California, Riverside, CA, USA
| | - Lei Zhang
- School of Medicine, Tsinghua University, Beijing, China
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| |
Collapse
|
28
|
Nunes Pinto CL, Nobre CN, Zárate LE. Transductive learning as an alternative to translation initiation site identification. BMC Bioinformatics 2017; 18:81. [PMID: 28152994 PMCID: PMC5290616 DOI: 10.1186/s12859-017-1502-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2016] [Accepted: 01/28/2017] [Indexed: 11/23/2022] Open
Abstract
Background The correct protein coding region identification is an important and latent problem in the molecular biology field. This problem becomes a challenge due to the lack of deep knowledge about the biological systems and unfamiliarity of conservative characteristics in the messenger RNA (mRNA). Therefore, it is fundamental to research for computational methods aiming to help the patterns discovery for identification of the Translation Initiation Sites (TIS). In the field of Bioinformatics, machine learning methods have been widely applied based on the inductive inference, as Inductive Support Vector Machine (ISVM). On the other hand, not so much attention has been given to transductive inference-based machine learning methods such as Transductive Support Vector Machine (TSVM). The transductive inference performs well for problems in which the amount of unlabeled sequences is considerably greater than the labeled ones. Similarly, the problem of predicting the TIS may take advantage of transductive methods due to the fact that the amount of new sequences grows rapidly with the progress of Genome Project that allows the study of new organisms. Consequently, this work aims to investigate the transductive learning towards TIS identification and compare the results with those obtained in inductive method. Results The transductive inference presents better results both in F-measure and in sensitivity in comparison with the inductive method for predicting the TIS. Additionally, it presents the least failure rate for identifying the TIS, presenting a smaller number of False Negatives (FN) than the ISVM. The ISVM and TSVM methods were validated with the molecules from the most representative organisms contained in the RefSeq database: Rattus norvegicus, Mus musculus, Homo sapiens, Drosophila melanogaster and Arabidopsis thaliana. The transductive method presented F-measure and sensitivity higher than 90% and also higher than the results obtained with ISVM. The ISVM and TSVM approaches were implemented in the TransduTIS tool, TransduTIS-I and TransduTIS-T respectively, available in a web interface. These approaches were compared with the TISHunter, TIS Miner, NetStart tools, presenting satisfactory results. Conclusions In relation to precision, the results are similar for the ISVM and TSVM classifiers. However, the results show that the application of TSVM approach ensured an improvement, specially for F-measure and sensitivity. Moreover, it was possible to identify a potential for the application of TSVM, which is for organisms in the initial study phase with few identified sequences in the databases. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1502-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Cristiane Neri Nobre
- Pontifical Catholic University of Minas Gerais - PUC-MG, 255, Walter Ianni Street, Belo Horizonte, 31980-110, Brazil
| | - Luis Enrique Zárate
- Pontifical Catholic University of Minas Gerais - PUC-MG, 255, Walter Ianni Street, Belo Horizonte, 31980-110, Brazil
| |
Collapse
|
29
|
Abstract
Bioinformatic analysis can not only accelerate drug target identification and drug candidate screening and refinement, but also facilitate characterization of side effects and predict drug resistance. High-throughput data such as genomic, epigenetic, genome architecture, cistromic, transcriptomic, proteomic, and ribosome profiling data have all made significant contribution to mechanismbased drug discovery and drug repurposing. Accumulation of protein and RNA structures, as well as development of homology modeling and protein structure simulation, coupled with large structure databases of small molecules and metabolites, paved the way for more realistic protein-ligand docking experiments and more informative virtual screening. I present the conceptual framework that drives the collection of these high-throughput data, summarize the utility and potential of mining these data in drug discovery, outline a few inherent limitations in data and software mining these data, point out news ways to refine analysis of these diverse types of data, and highlight commonly used software and databases relevant to drug discovery.
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology, Faculty of Science, University of Ottawa, Ottawa, Ontario K1N 6N5, Canada
- Ottawa Institute of Systems Biology, Ottawa K1H 8M5, Canada
| |
Collapse
|
30
|
Al Bataineh M, Al-qudah Z. A novel gene identification algorithm with Bayesian classification. Biomed Signal Process Control 2017. [DOI: 10.1016/j.bspc.2016.07.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
31
|
Lai CM, Yeh WC, Chang CY. Gene selection using information gain and improved simplified swarm optimization. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.08.089] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
32
|
Pérez-Rodríguez J, García-Pedrajas N. Stepwise approach for combining many sources of evidence for site-recognition in genomic sequences. BMC Bioinformatics 2016; 17:117. [PMID: 26945666 PMCID: PMC4779560 DOI: 10.1186/s12859-016-0968-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Accepted: 02/22/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recognizing the different functional parts of genes, such as promoters, translation initiation sites, donors, acceptors and stop codons, is a fundamental task of many current studies in Bioinformatics. Currently, the most successful methods use powerful classifiers, such as support vector machines with various string kernels. However, with the rapid evolution of our ability to collect genomic information, it has been shown that combining many sources of evidence is fundamental to the success of any recognition task. With the advent of next-generation sequencing, the number of available genomes is increasing very rapidly. Thus, methods for making use of such large amounts of information are needed. RESULTS In this paper, we present a methodology for combining tens or even hundreds of different classifiers for an improved performance. Our approach can include almost a limitless number of sources of evidence. We can use the evidence for the prediction of sites in a certain species, such as human, or other species as needed. This approach can be used for any of the functional recognition tasks cited above. However, to provide the necessary focus, we have tested our approach in two functional recognition tasks: translation initiation site and stop codon recognition. We have used the entire human genome as a target and another 20 species as sources of evidence and tested our method on five different human chromosomes. The proposed method achieves better accuracy than the best state-of-the-art method both in terms of the geometric mean of the specificity and sensitivity and the area under the receiver operating characteristic and precision recall curves. Furthermore, our approach shows a more principled way for selecting the best genomes to be combined for a given recognition task. CONCLUSIONS Our approach has proven to be a powerful tool for improving the performance of functional site recognition, and it is a useful method for combining many sources of evidence for any recognition task in Bioinformatics. The results also show that the common approach of heuristically choosing the species to be used as source of evidence can be improved because the best combinations of genomes for recognition were those not usually selected. Although the experiments were performed for translation initiation site and stop codon recognition, any other recognition task may benefit from our methodology.
Collapse
Affiliation(s)
- Javier Pérez-Rodríguez
- Department of Computing and Numerical Analysis, University of Córdoba, Córdoba, 14071, Campus de Rabanales, Spain.
| | - Nicolás García-Pedrajas
- Department of Computing and Numerical Analysis, University of Córdoba, Córdoba, 14071, Campus de Rabanales, Spain.
| |
Collapse
|
33
|
Koyano H, Hayashida M, Akutsu T. Maximum margin classifier working in a set of strings. Proc Math Phys Eng Sci 2016; 472:20150551. [PMID: 27118908 PMCID: PMC4841474 DOI: 10.1098/rspa.2015.0551] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2015] [Accepted: 02/02/2016] [Indexed: 11/12/2022] Open
Abstract
Numbers and numerical vectors account for a large portion of data. However, recently, the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem for a consensus sequence of strings demonstrated by one of the authors and co-workers in a previous study. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein-protein interactions using amino acid sequences and classifying RNAs by the secondary structure using nucleotide sequences.
Collapse
Affiliation(s)
- Hitoshi Koyano
- Laboratory of Biostatistics and Bioinformatics, Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin, Sakyo-ku, Kyoto 606-8507, Japan
| | - Morihiro Hayashida
- Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| | - Tatsuya Akutsu
- Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| |
Collapse
|
34
|
Herndon N, Caragea D. A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction. IEEE Trans Nanobioscience 2016; 15:75-83. [PMID: 26849871 DOI: 10.1109/tnb.2016.2522400] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting. We propose two such classifiers, based on logistic regression, and evaluate them for the task of splice site prediction-a difficult and essential step in gene prediction. Our classifiers achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.
Collapse
|
35
|
Meher PK, Sahu TK, Rao AR. Prediction of donor splice sites using random forest with a new sequence encoding approach. BioData Min 2016; 9:4. [PMID: 26807151 PMCID: PMC4724119 DOI: 10.1186/s13040-016-0086-4] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2015] [Accepted: 01/19/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Detection of splice sites plays a key role for predicting the gene structure and thus development of efficient analytical methods for splice site prediction is vital. This paper presents a novel sequence encoding approach based on the adjacent di-nucleotide dependencies in which the donor splice site motifs are encoded into numeric vectors. The encoded vectors are then used as input in Random Forest (RF), Support Vector Machines (SVM) and Artificial Neural Network (ANN), Bagging, Boosting, Logistic regression, kNN and Naïve Bayes classifiers for prediction of donor splice sites. RESULTS The performance of the proposed approach is evaluated on the donor splice site sequence data of Homo sapiens, collected from Homo Sapiens Splice Sites Dataset (HS3D). The results showed that RF outperformed all the considered classifiers. Besides, RF achieved higher prediction accuracy than the existing methods viz., MEM, MDD, WMM, MM1, NNSplice and SpliceView, while compared using an independent test dataset. CONCLUSION Based on the proposed approach, we have developed an online prediction server (MaLDoSS) to help the biological community in predicting the donor splice sites. The server is made freely available at http://cabgrid.res.in:8080/maldoss. Due to computational feasibility and high prediction accuracy, the proposed approach is believed to help in predicting the eukaryotic gene structure.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, Indian Agricultural Statistics Research Institute, New Delhi, 110 012 India
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110 012 India
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110 012 India
| |
Collapse
|
36
|
A Comprehensive Review of Emerging Computational Methods for Gene Identification. JOURNAL OF INFORMATION PROCESSING SYSTEMS 2016. [DOI: 10.3745/jips.04.0023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
37
|
Vidovic MMC, Görnitz N, Müller KR, Rätsch G, Kloft M. SVM2Motif--Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor. PLoS One 2015; 10:e0144782. [PMID: 26690911 PMCID: PMC4686957 DOI: 10.1371/journal.pone.0144782] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2015] [Accepted: 11/22/2015] [Indexed: 12/02/2022] Open
Abstract
Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but--due to its black-box character--motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs--regardless of their length and complexity--underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.
Collapse
Affiliation(s)
| | - Nico Görnitz
- Machine Learning Group, Technical University of Berlin, Berlin, Germany
| | - Klaus-Robert Müller
- Machine Learning Group, Technical University of Berlin, Berlin, Germany
- Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 136–713, Korea
| | - Gunnar Rätsch
- Memorial Sloan-Kettering Cancer Center, New York City, New York, United States of America
| | - Marius Kloft
- Department of Computer Science, Humboldt University of Berlin, Berlin, Germany
| |
Collapse
|
38
|
Kabir M, Iqbal M, Ahmad S, Hayat M. iTIS-PseKNC: Identification of Translation Initiation Site in human genes using pseudo k-tuple nucleotides composition. Comput Biol Med 2015; 66:252-7. [PMID: 26433457 DOI: 10.1016/j.compbiomed.2015.09.010] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2015] [Accepted: 09/14/2015] [Indexed: 10/23/2022]
Abstract
Translation is an essential genetic process for understanding the mechanism of gene expression. Due to the large number of protein sequences generated in the post-genomic era, conventional methods are unable to identify Translation Initiation Site (TIS) in human genes timely and accurately. It is thus highly desirable to develop an automatic and accurate computational model for identification of TIS. Considerable improvements have been achieved in developing computational models; however, development of accurate and reliable automated systems for TIS identification in human genes is still a challenging task. In this connection, we propose iTIS-PseKNC, a novel protocol for identification of TIS. Three protein sequence representation methods including dinucleotide composition, pseudo-dinucleotide composition and Trinucleotide composition have been used in order to extract numerical descriptors. Support Vector Machine (SVM), K-nearest neighbor and Probabilistic Neural Network are assessed for their performance using the constructed descriptors. The proposed model iTIS-PseKNC has achieved 99.40% accuracy using jackknife test. The experimental results validated the superior performance of iTIS-PseKNC over the existing methods reported in the literature. It is highly anticipated that the iTIS-PseKNC predictor will be useful for basic research studies.
Collapse
Affiliation(s)
- Muhammad Kabir
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Muhammad Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Saeed Ahmad
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| |
Collapse
|
39
|
Abstract
The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.
Collapse
Affiliation(s)
- Maxwell W Libbrecht
- Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA
| | - William Stafford Noble
- 1] Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA. [2] Department of Genome Sciences, University of Washington, 3720 15th Ave NE Seattle, Washington 98195-5065, USA
| |
Collapse
|
40
|
Kumar R, Srivastava A, Kumari B, Kumar M. Prediction of β-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine. J Theor Biol 2015; 365:96-103. [DOI: 10.1016/j.jtbi.2014.10.008] [Citation(s) in RCA: 125] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2014] [Revised: 10/01/2014] [Accepted: 10/06/2014] [Indexed: 01/01/2023]
|
41
|
An improved poly(A) motifs recognition method based on decision level fusion. Comput Biol Chem 2014; 54:49-56. [PMID: 25594576 DOI: 10.1016/j.compbiolchem.2014.12.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2014] [Revised: 11/27/2014] [Accepted: 12/27/2014] [Indexed: 01/07/2023]
Abstract
Polyadenylation is the process of addition of poly(A) tail to mRNA 3' ends. Identification of motifs controlling polyadenylation plays an essential role in improving genome annotation accuracy and better understanding of the mechanisms governing gene regulation. The bioinformatics methods used for poly(A) motifs recognition have demonstrated that information extracted from sequences surrounding the candidate motifs can differentiate true motifs from the false ones greatly. However, these methods depend on either domain features or string kernels. To date, methods combining information from different sources have not been found yet. Here, we proposed an improved poly(A) motifs recognition method by combing different sources based on decision level fusion. First of all, two novel prediction methods was proposed based on support vector machine (SVM): one method is achieved by using the domain-specific features and principle component analysis (PCA) method to eliminate the redundancy (PCA-SVM); the other method is based on Oligo string kernel (Oligo-SVM). Then we proposed a novel machine-learning method for poly(A) motif prediction by marrying four poly(A) motifs recognition methods, including two state-of-the-art methods (Random Forest (RF) and HMM-SVM), and two novel proposed methods (PCA-SVM and Oligo-SVM). A decision level information fusion method was employed to combine the decision values of different classifiers by applying the DS evidence theory. We evaluated our method on a comprehensive poly(A) dataset that consists of 14,740 samples on 12 variants of poly(A) motifs and 2750 samples containing none of these motifs. Our method has achieved accuracy up to 86.13%. Compared with the four classifiers, our evidence theory based method reduces the average error rate by about 30%, 27%, 26% and 16%, respectively. The experimental results suggest that the proposed method is more effective for poly(A) motif recognition.
Collapse
|
42
|
Meher PK, Sahu TK, Rao AR, Wahi SD. A statistical approach for 5' splice site prediction using short sequence motifs and without encoding sequence data. BMC Bioinformatics 2014; 15:362. [PMID: 25420551 PMCID: PMC4702320 DOI: 10.1186/s12859-014-0362-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2014] [Accepted: 10/24/2014] [Indexed: 11/17/2022] Open
Abstract
Background Most of the approaches for splice site prediction are based on machine learning techniques. Though, these approaches provide high prediction accuracy, the window lengths used are longer in size. Hence, these approaches may not be suitable to predict the novel splice variants using the short sequence reads generated from next generation sequencing technologies. Further, machine learning techniques require numerically encoded data and produce different accuracy with different encoding procedures. Therefore, splice site prediction with short sequence motifs and without encoding sequence data became a motivation for the present study. Results An approach for finding association among nucleotide bases in the splice site motifs is developed and used further to determine the appropriate window size. Besides, an approach for prediction of donor splice sites using sum of absolute error criterion has also been proposed. The proposed approach has been compared with commonly used approaches i.e., Maximum Entropy Modeling (MEM), Maximal Dependency Decomposition (MDD), Weighted Matrix Method (WMM) and Markov Model of first order (MM1) and was found to perform equally with MEM and MDD and better than WMM and MM1 in terms of prediction accuracy. Conclusions The proposed prediction approach can be used in the prediction of donor splice sites with higher accuracy using short sequence motifs and hence can be used as a complementary method to the existing approaches. Based on the proposed methodology, a web server was also developed for easy prediction of donor splice sites by users and is available at http://cabgrid.res.in:8080/sspred. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0362-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Sant Dass Wahi
- Division of Statistical Genetics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| |
Collapse
|
43
|
Chen W, Feng PM, Deng EZ, Lin H, Chou KC. iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Anal Biochem 2014; 462:76-83. [PMID: 25016190 DOI: 10.1016/j.ab.2014.06.022] [Citation(s) in RCA: 218] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2014] [Revised: 06/26/2014] [Accepted: 06/27/2014] [Indexed: 01/25/2023]
Abstract
Translation is a key process for gene expression. Timely identification of the translation initiation site (TIS) is very important for conducting in-depth genome analysis. With the avalanche of genome sequences generated in the postgenomic age, it is highly desirable to develop automated methods for rapidly and effectively identifying TIS. Although some computational methods were proposed in this regard, none of them considered the global or long-range sequence-order effects of DNA, and hence their prediction quality was limited. To count this kind of effects, a new predictor, called "iTIS-PseTNC," was developed by incorporating the physicochemical properties into the pseudo trinucleotide composition, quite similar to the PseAAC (pseudo amino acid composition) approach widely used in computational proteomics. It was observed by the rigorous cross-validation test on the benchmark dataset that the overall success rate achieved by the new predictor in identifying TIS locations was over 97%. As a web server, iTIS-PseTNC is freely accessible at http://lin.uestc.edu.cn/server/iTIS-PseTNC. To maximize the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web server to obtain the desired results without the need to go through detailed mathematical equations, which are presented in this paper just for the integrity of the new prection method.
Collapse
Affiliation(s)
- Wei Chen
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China; Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Peng-Mian Feng
- School of Public Health, Hebei United University, Tangshan 063000, China.
| | - En-Ze Deng
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Hao Lin
- Gordon Life Science Institute, Boston, MA 02478, USA; Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Kuo-Chen Chou
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China; Gordon Life Science Institute, Boston, MA 02478, USA; Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia.
| |
Collapse
|
44
|
Pérez-Rodríguez J, Arroyo-Peña AG, García-Pedrajas N. Improving translation initiation site and stop codon recognition by using more than two classes. Bioinformatics 2014; 30:2702-8. [PMID: 24903421 DOI: 10.1093/bioinformatics/btu369] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The recognition of translation initiation sites and stop codons is a fundamental part of any gene recognition program. Currently, the most successful methods use powerful classifiers, such as support vector machines with various string kernels. These methods all use two classes, one of positive instances and another one of negative instances that are constructed using sequences from the whole genome. However, the features of the negative sequences differ depending on the position of the negative samples in the gene. There are differences depending on whether they are from exons, introns, intergenic regions or any other functional part of the genome. Thus, the positive class is fairly homogeneous, as all its sequences come from the same part of the gene, but the negative class is composed of different instances. The classifier suffers from this problem. In this article, we propose the training of different classifiers with different negative, more homogeneous, classes and the combination of these classifiers for improved accuracy. RESULTS The proposed method achieves better accuracy than the best state-of-the-art method, both in terms of the geometric mean of the specificity and sensitivity and the area under the receiver operating characteristic and precision recall curves. The method is tested on the whole human genome. The results for recognizing both translation initiation sites and stop codons indicated improvements in the rates of both false-negative results (FN) and false-positive results (FP). On an average, for translation initiation site recognition, the false-negative ratio was reduced by 30.2% and the FP ratio decreased by 10.9%. For stop codon prediction, FP were reduced by 41.4% and FN by 31.7%. AVAILABILITY AND IMPLEMENTATION The source code is licensed under the General Public License and is thus freely available. The datasets and source code can be obtained from http://cib.uco.es/site-recognition. CONTACT npedrajas@uco.es.
Collapse
Affiliation(s)
- Javier Pérez-Rodríguez
- Department of Computing and Numerical Analysis, University of Córdoba, Campus Universitario de Rabanales, Edificio Einstein, Planta 3, 14071 Córdoba, Spain
| | - Alexis G Arroyo-Peña
- Department of Computing and Numerical Analysis, University of Córdoba, Campus Universitario de Rabanales, Edificio Einstein, Planta 3, 14071 Córdoba, Spain
| | - Nicolás García-Pedrajas
- Department of Computing and Numerical Analysis, University of Córdoba, Campus Universitario de Rabanales, Edificio Einstein, Planta 3, 14071 Córdoba, Spain
| |
Collapse
|
45
|
Dameh TA, Abd-Almageed W, Hefeeda M. Distributed Kernel Matrix Approximation and Implementation Using Message Passing Interface. 2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS 2013. [DOI: 10.1109/icmla.2013.17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
46
|
Xie HL, Fu L, Nie XD. Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou's PseAAC. Protein Eng Des Sel 2013; 26:735-42. [PMID: 24048266 DOI: 10.1093/protein/gzt042] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
As the most frequent drug target, G-protein coupled receptors (GPCRs) are a large family of seven transmembrane receptors that sense molecules outside the cell and activate inside signal transduction pathways. Glycosylation is one of the most complex post-translational modifications (PTMs) of proteins in eukaryotic cells. It plays important roles in a variety of cellular functions, including protein folding, protein trafficking and localization, cell-cell interactions and epitope recognition. Therefore, investigating the exact position of glycosylation site in GPCR sequence can provide useful clues for drug design and other biotechnology applications. Experimental identification of glycosylation sites is expensive and laborious. Hence, there is a significant interest in the development of computational methods for reliable prediction of glycosylation sites from amino acid sequences. In this article, we presented an effective method to recognize the sites of human GPCRs by combining amino acid hydrophobicity with ensemble support vector machine. The prediction accuracy, sensitivity, specificity, Matthews correlation coefficient and area under the curve values were 94.4, 89.7, 98.9%, 0.895 and 0.989, respectively. The establishment of such a fast and accurate prediction method will speed up the pace of identifying proper GPCRs functional sites to facilitate drug discovery.
Collapse
Affiliation(s)
- Hua-Lin Xie
- School of Chemistry and Chemical Engineering, Central South University, Changsha 410083, People's Republic of China
| | | | | |
Collapse
|
47
|
Chang CCH, Song J, Tey BT, Ramanan RN. Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Brief Bioinform 2013; 15:953-62. [DOI: 10.1093/bib/bbt057] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
48
|
McEachern A, Ashlock D, Schonfeld J. Sequence classification with side effect machines evolved via ring optimization. Biosystems 2013; 113:9-27. [PMID: 23603215 DOI: 10.1016/j.biosystems.2013.03.022] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2011] [Revised: 03/29/2013] [Accepted: 03/31/2013] [Indexed: 10/26/2022]
Abstract
The explosion of available sequence data necessitates the development of sophisticated machine learning tools with which to analyze them. This study introduces a sequence-learning technology called side effect machines. It also applies a model of evolution which simulates the evolution of a ring species to the training of the side effect machines. A comparison is done between side effect machines evolved in the ring structure and side effect machines evolved using a standard evolutionary algorithm based on tournament selection. At the core of the training of side effect machines is a nearest neighbor classifier. A parameter study was performed to investigate the impact of the division of training data into examples for nearest neighbor assessment and training cases. The parameter study demonstrates that parameter setting is important in the baseline runs but had little impact in the ring-optimization runs. The ring optimization technique was also found to exhibit improved and also more reliable training performance. Side effect machines are tested on two types of synthetic data, one based on GC-content and the other checking for the ability of side effect machines to recognize an embedded motif. Three types of biological data are used, a data set with different types of immune-system genes, a data set with normal and retro-virally derived human genomic sequence, and standard and nonstandard initiation regions from the cytochrome-oxidase subunit one in the mitochondrial genome.
Collapse
Affiliation(s)
- Andrew McEachern
- Department of Mathematics and Statistics, University of Guelph, Guelph, Ontario, Canada N1G 2W1.
| | | | | |
Collapse
|
49
|
Xia X. Position weight matrix, gibbs sampler, and the associated significance tests in motif characterization and prediction. SCIENTIFICA 2012; 2012:917540. [PMID: 24278755 PMCID: PMC3820676 DOI: 10.6064/2012/917540] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2012] [Accepted: 10/11/2012] [Indexed: 05/31/2023]
Abstract
Position weight matrix (PWM) is not only one of the most widely used bioinformatic methods, but also a key component in more advanced computational algorithms (e.g., Gibbs sampler) for characterizing and discovering motifs in nucleotide or amino acid sequences. However, few generally applicable statistical tests are available for evaluating the significance of site patterns, PWM, and PWM scores (PWMS) of putative motifs. Statistical significance tests of the PWM output, that is, site-specific frequencies, PWM itself, and PWMS, are in disparate sources and have never been collected in a single paper, with the consequence that many implementations of PWM do not include any significance test. Here I review PWM-based methods used in motif characterization and prediction (including a detailed illustration of the Gibbs sampler for de novo motif discovery), present statistical and probabilistic rationales behind statistical significance tests relevant to PWM, and illustrate their application with real data. The multiple comparison problem associated with the test of site-specific frequencies is best handled by false discovery rate methods. The test of PWM, due to the use of pseudocounts, is best done by resampling methods. The test of individual PWMS for each sequence segment should be based on the extreme value distribution.
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology, University of Ottawa, 30 Marie Curie, Ottawa, ON, Canada K1N 6N5
| |
Collapse
|
50
|
Li JL, Wang LF, Wang HY, Bai LY, Yuan ZM. High-accuracy splice site prediction based on sequence component and position features. GENETICS AND MOLECULAR RESEARCH 2012; 11:3432-51. [PMID: 23079837 DOI: 10.4238/2012.september.25.12] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Identification of splice sites plays a key role in the annotation of genes. Consequently, improvement of computational prediction of splice sites would be very useful. We examined the effect of the window size and the number and position of the consensus bases with a chi-square test, and then extracted the sequence multi-scale component features and the position and adjacent position relationship features of consensus sites. Then, we constructed a novel classification model using a support vector machine with the previously selected features and applied it to the Homo sapiens splice site dataset. This method greatly improved cross-validation accuracies for training sets with true and spurious splice sites of both equal and different proportions. This method was also applied to the NN269 dataset for further evaluation and independent testing. The results were superior to those obtained with previous methods, and demonstrate the stability and superiority of this method for prediction of splice sites.
Collapse
Affiliation(s)
- J L Li
- Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Hunan Agricultural University, Changsha, China
| | | | | | | | | |
Collapse
|