1
|
Yom A, Chiang A, Lewis NE. Boltzmann Model Predicts Glycan Structures from Lectin Binding. Anal Chem 2024; 96:8332-8341. [PMID: 38720429 PMCID: PMC11162346 DOI: 10.1021/acs.analchem.3c04992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2024]
Abstract
Glycans are complex oligosaccharides that are involved in many diseases and biological processes. Unfortunately, current methods for determining glycan composition and structure (glycan sequencing) are laborious and require a high level of expertise. Here, we assess the feasibility of sequencing glycans based on their lectin binding fingerprints. By training a Boltzmann model on lectin binding data, we predict the approximate structures of 88 ± 7% of N-glycans and 87 ± 13% of O-glycans in our test set. We show that our model generalizes well to the pharmaceutically relevant case of Chinese hamster ovary (CHO) cell glycans. We also analyze the motif specificity of a wide array of lectins and identify the most and least predictive lectins and glycan features. These results could help streamline glycoprotein research and be of use to anyone using lectins for glycobiology.
Collapse
Affiliation(s)
- Aria Yom
- Department of Physics, University of California, San Diego, California 92093, United States
| | - Austin Chiang
- Department of Pediatrics, University of California, San Diego, California 92093, United States
- Immunology Center of Georgia, Augusta University, Augusta, Georgia 30912, United States
- Department of Medicine, Augusta University, Augusta, Georgia 30912, United States
| | - Nathan E Lewis
- Department of Pediatrics, University of California, San Diego, California 92093, United States
- Department of Bioengineering, University of California, San Diego, California 92093, United States
| |
Collapse
|
2
|
Abstract
Artificial intelligence (AI) methods have been and are now being increasingly integrated in prediction software implemented in bioinformatics and its glycoscience branch known as glycoinformatics. AI techniques have evolved in the past decades, and their applications in glycoscience are not yet widespread. This limited use is partly explained by the peculiarities of glyco-data that are notoriously hard to produce and analyze. Nonetheless, as time goes, the accumulation of glycomics, glycoproteomics, and glycan-binding data has reached a point where even the most recent deep learning methods can provide predictors with good performance. We discuss the historical development of the application of various AI methods in the broader field of glycoinformatics. A particular focus is placed on shining a light on challenges in glyco-data handling, contextualized by lessons learnt from related disciplines. Ending on the discussion of state-of-the-art deep learning approaches in glycoinformatics, we also envision the future of glycoinformatics, including development that need to occur in order to truly unleash the capabilities of glycoscience in the systems biology era.
Collapse
Affiliation(s)
- Daniel Bojar
- Department
of Chemistry and Molecular Biology, University
of Gothenburg, Gothenburg 41390, Sweden
- Wallenberg
Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg 41390, Sweden
| | - Frederique Lisacek
- Proteome
Informatics Group, Swiss Institute of Bioinformatics, CH-1227 Geneva, Switzerland
- Computer
Science Department & Section of Biology, University of Geneva, route de Drize 7, CH-1227, Geneva, Switzerland
| |
Collapse
|
3
|
Carpenter EJ, Seth S, Yue N, Greiner R, Derda R. GlyNet: a multi-task neural network for predicting protein-glycan interactions. Chem Sci 2022; 13:6669-6686. [PMID: 35756507 PMCID: PMC9172296 DOI: 10.1039/d1sc05681f] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Accepted: 05/02/2022] [Indexed: 12/14/2022] Open
Abstract
Advances in diagnostics, therapeutics, vaccines, transfusion, and organ transplantation build on a fundamental understanding of glycan-protein interactions. To aid this, we developed GlyNet, a model that accurately predicts interactions (relative binding strengths) between mammalian glycans and 352 glycan-binding proteins, many at multiple concentrations. For each glycan input, our model produces 1257 outputs, each representing the relative interaction strength between the input glycan and a particular protein sample. GlyNet learns these continuous values using relative fluorescence units (RFUs) measured on 599 glycans in the Consortium for Functional Glycomics glycan arrays and extrapolates these to RFUs from additional, untested glycans. GlyNet's output of continuous values provides more detailed results than the standard binary classification models. After incorporating a simple threshold to transform such continuous outputs the resulting GlyNet classifier outperforms those standard classifiers. GlyNet is the first multi-output regression model for predicting protein-glycan interactions and serves as an important benchmark, facilitating development of quantitative computational glycobiology.
Collapse
Affiliation(s)
- Eric J Carpenter
- Department of Chemistry, University of Alberta Edmonton Alberta Canada
| | - Shaurya Seth
- Department of Chemistry, University of Alberta Edmonton Alberta Canada
| | - Noel Yue
- Department of Chemistry, University of Alberta Edmonton Alberta Canada
| | - Russell Greiner
- Department of Computing Science, University of Alberta Edmonton Alberta Canada
- Alberta Machine Intelligence Institute (AMII) Edmonton Alberta Canada
| | - Ratmir Derda
- Department of Chemistry, University of Alberta Edmonton Alberta Canada
| |
Collapse
|
4
|
Flevaris K, Kontoravdi C. Immunoglobulin G N-glycan Biomarkers for Autoimmune Diseases: Current State and a Glycoinformatics Perspective. Int J Mol Sci 2022; 23:5180. [PMID: 35563570 PMCID: PMC9100869 DOI: 10.3390/ijms23095180] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Revised: 05/02/2022] [Accepted: 05/04/2022] [Indexed: 02/04/2023] Open
Abstract
The effective treatment of autoimmune disorders can greatly benefit from disease-specific biomarkers that are functionally involved in immune system regulation and can be collected through minimally invasive procedures. In this regard, human serum IgG N-glycans are promising for uncovering disease predisposition and monitoring progression, and for the identification of specific molecular targets for advanced therapies. In particular, the IgG N-glycome in diseased tissues is considered to be disease-dependent; thus, specific glycan structures may be involved in the pathophysiology of autoimmune diseases. This study provides a critical overview of the literature on human IgG N-glycomics, with a focus on the identification of disease-specific glycan alterations. In order to expedite the establishment of clinically-relevant N-glycan biomarkers, the employment of advanced computational tools for the interpretation of clinical data and their relationship with the underlying molecular mechanisms may be critical. Glycoinformatics tools, including artificial intelligence and systems glycobiology approaches, are reviewed for their potential to provide insight into patient stratification and disease etiology. Challenges in the integration of such glycoinformatics approaches in N-glycan biomarker research are critically discussed.
Collapse
Affiliation(s)
| | - Cleo Kontoravdi
- Department of Chemical Engineering, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
5
|
A vectorial tree distance measure. Sci Rep 2022; 12:5256. [PMID: 35347186 PMCID: PMC8960910 DOI: 10.1038/s41598-022-08360-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 02/28/2022] [Indexed: 11/08/2022] Open
Abstract
A vectorial distance measure for trees is presented. Given two trees, we define a Tree-Alignment (T-Alignment). We T-align the trees from their centers outwards, starting from the root-branches, to make the next level as similar as possible. The algorithm is recursive; condition on the T-alignment of the root-branches we T-align the sub-branches, thereafter each T-alignment is conditioned on the previous one. We define a minimal T-alignment under a lexicographic order which follows the intuition that the differences between the two trees constitutes a vector. Given such a minimal T-alignment, the difference in the number of branches calculated at any level defines the entry of the distance vector at that level. We compare our algorithm to other well-known tree distance measures in the task of clustering sets of phylogenetic trees. We use the TreeSimGM simulator for generating stochastic phylogenetic trees. The vectorial tree distance (VTD) can successfully separate symmetric from asymmetric trees, and hierarchical from non-hierarchical trees. We also test the algorithm as a classifier of phylogenetic trees extracted from two members of the fungi kingdom, mushrooms and mildews, thus showimg that the algorithm can separate real world phylogenetic trees. The Matlab code can be accessed via: https://gitlab.com/avner.priel/vectorial-tree-distance .
Collapse
|
6
|
Mohapatra S, An J, Gómez-Bombarelli R. Chemistry-informed macromolecule graph representation for similarity computation, unsupervised and supervised learning. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1088/2632-2153/ac545e] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Abstract
The near-infinite chemical diversity of natural and artificial macromolecules arises from the vast range of possible component monomers, linkages, and polymers topologies. This enormous variety contributes to the ubiquity and indispensability of macromolecules but hinders the development of general machine learning methods with macromolecules as input. To address this, we developed a chemistry-informed graph representation of macromolecules that enables quantifying structural similarity, and interpretable supervised learning for macromolecules. Our work enables quantitative chemistry-informed decision-making and iterative design in the macromolecular chemical space.
Collapse
|
7
|
Coff L, Chan J, Ramsland PA, Guy AJ. Identifying glycan motifs using a novel subtree mining approach. BMC Bioinformatics 2020; 21:42. [PMID: 32019496 PMCID: PMC7001330 DOI: 10.1186/s12859-020-3374-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Accepted: 01/20/2020] [Indexed: 11/17/2022] Open
Abstract
Background Glycans are complex sugar chains, crucial to many biological processes. By participating in binding interactions with proteins, glycans often play key roles in host–pathogen interactions. The specificities of glycan-binding proteins, such as lectins and antibodies, are governed by motifs within larger glycan structures, and improved characterisations of these determinants would aid research into human diseases. Identification of motifs has previously been approached as a frequent subtree mining problem, and we extend these approaches with a glycan notation that allows recognition of terminal motifs. Results In this work, we customised a frequent subtree mining approach by altering the glycan notation to include information on terminal connections. This allows specific identification of terminal residues as potential motifs, better capturing the complexity of glycan-binding interactions. We achieved this by including additional nodes in a graph representation of the glycan structure to indicate the presence or absence of a linkage at particular backbone carbon positions. Combining this frequent subtree mining approach with a state-of-the-art feature selection algorithm termed minimum-redundancy, maximum-relevance (mRMR), we have generated a classification pipeline that is trained on data from a glycan microarray. When applied to a set of commonly used lectins, the identified motifs were consistent with known binding determinants. Furthermore, logistic regression classifiers trained using these motifs performed well across most lectins examined, with a median AUC value of 0.89. Conclusions We present here a new subtree mining approach for the classification of glycan binding and identification of potential binding motifs. The Carbohydrate Classification Accounting for Restricted Linkages (CCARL) method will assist in the interpretation of glycan microarray experiments and will aid in the discovery of novel binding motifs for further experimental characterisation.
Collapse
Affiliation(s)
- Lachlan Coff
- School of Science, College of Science, Engineering and Health, RMIT University, 3000, Melbourne, Australia
| | - Jeffrey Chan
- School of Science, College of Science, Engineering and Health, RMIT University, 3000, Melbourne, Australia
| | - Paul A Ramsland
- School of Science, College of Science, Engineering and Health, RMIT University, 3000, Melbourne, Australia.,Department of Immunology, Monash University, 3004, Melbourne, Australia.,Department of Surgery Austin Health, University of Melbourne, 3084, Heidelberg, Australia
| | - Andrew J Guy
- School of Science, College of Science, Engineering and Health, RMIT University, 3000, Melbourne, Australia.
| |
Collapse
|
8
|
Haab BB, Klamer Z. Advances in Tools to Determine the Glycan-Binding Specificities of Lectins and Antibodies. Mol Cell Proteomics 2020; 19:224-232. [PMID: 31848260 PMCID: PMC7000120 DOI: 10.1074/mcp.r119.001836] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Revised: 12/13/2019] [Indexed: 01/17/2023] Open
Abstract
Proteins that bind carbohydrate structures can serve as tools to quantify or localize specific glycans in biological specimens. Such proteins, including lectins and glycan-binding antibodies, are particularly valuable if accurate information is available about the glycans that a protein binds. Glycan arrays have been transformational for uncovering rich information about the nuances and complexities of glycan-binding specificity. A challenge, however, has been the analysis of the data. Because protein-glycan interactions are so complex, simplistic modes of analyzing the data and describing glycan-binding specificities have proven inadequate in many cases. This review surveys the methods for handling high-content data on protein-glycan interactions. We contrast the approaches that have been demonstrated and provide an overview of the resources that are available. We also give an outlook on the promising experimental technologies for generating new insights into protein-glycan interactions, as well as a perspective on the limitations that currently face the field.
Collapse
|
9
|
Akiyoshi S, Iwata M, Berenger F, Yamanishi Y. Omics-based Identification of Glycan Structures as Biomarkers for a Variety of Diseases. Mol Inform 2019; 39:e1900112. [PMID: 31622036 DOI: 10.1002/minf.201900112] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Accepted: 09/24/2019] [Indexed: 12/11/2022]
Abstract
Glycans play important roles in cell communication, protein interaction, and immunity, and structural changes in glycans are associated with the regulation of a range of biological pathways involved in disease. However, our understanding of the detailed relationships between specific diseases and glycans is very limited. In this study, we proposed an omics-based method to investigate the correlations between glycans and a wide range of human diseases. We analyzed the gene expression patterns of glycogenes (glycosyltransferases and glycosidases) for 79 different diseases. A biological pathway-based glycogene signature was constructed to identify the alteration in glycan biosynthesis and the associated glycan structures for each disease state. The degradation of N-glycan and keratan sulfate, for example, may promote the growth or metastasis of multiple types of cancer, including endometrial, gastric, and nasopharyngeal. Our results also revealed that commonalities between diseases can be interpreted using glycogene expression patterns, as well as the associated glycan structure patterns at the level of the affected pathway. The proposed method is expected to be useful for understanding the relationships between glycans, glycogenes, and disease and identifying disease-specific glycan biomarkers.
Collapse
Affiliation(s)
- Sayaka Akiyoshi
- Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, Fukuoka, 812-8582, Japan
| | - Michio Iwata
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
| | - Francois Berenger
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
| | - Yoshihiro Yamanishi
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
| |
Collapse
|
10
|
Hosoda M, Akune Y, Aoki-Kinoshita KF. Development and application of an algorithm to compute weighted multiple glycan alignments. Bioinformatics 2017; 33:1317-1323. [PMID: 28093404 PMCID: PMC5408794 DOI: 10.1093/bioinformatics/btw827] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Revised: 12/22/2016] [Accepted: 01/10/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation A glycan consists of monosaccharides linked by glycosidic bonds, has branches and forms complex molecular structures. Databases have been developed to store large amounts of glycan-binding experiments, including glycan arrays with glycan-binding proteins. However, there are few bioinformatics techniques to analyze large amounts of data for glycans because there are few tools that can handle the complexity of glycan structures. Thus, we have developed the MCAW (Multiple Carbohydrate Alignment with Weights) tool that can align multiple glycan structures, to aid in the understanding of their function as binding recognition molecules. Results We have described in detail the first algorithm to perform multiple glycan alignments by modeling glycans as trees. To test our tool, we prepared several data sets, and as a result, we found that the glycan motif could be successfully aligned without any prior knowledge applied to the tool, and the known recognition binding sites of glycans could be aligned at a high rate amongst all our datasets tested. We thus claim that our tool is able to find meaningful glycan recognition and binding patterns using data obtained by glycan-binding experiments. The development and availability of an effective multiple glycan alignment tool opens possibilities for many other glycoinformatics analysis, making this work a big step towards furthering glycomics analysis. Availability and Implementation http://www.rings.t.soka.ac.jp. Contact kkiyoko@soka.ac.jp. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Masae Hosoda
- Department of Bioinformatics, Graduate School of Engineering, Soka University, Tokyo, Japan
| | - Yukie Akune
- Department of Bioinformatics, Graduate School of Engineering, Soka University, Tokyo, Japan
| | | |
Collapse
|
11
|
Takigawa I, Mamitsuka H. Generalized Sparse Learning of Linear Models Over the Complete Subgraph Feature Set. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2017; 39:617-624. [PMID: 27187949 DOI: 10.1109/tpami.2016.2567399] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Supervised learning over graphs is an intrinsically difficult problem: simultaneous learning of relevant features from the complete subgraph feature set, in which enumerating all subgraph features occurring in given graphs is practically intractable due to combinatorial explosion. We show that 1) existing graph supervised learning studies, such as Adaboost, LPBoost, and LARS/LASSO, can be viewed as variations of a branch-and-bound algorithm with simple bounds, which we call Morishita-Kudo bounds; 2) We present a direct sparse optimization algorithm for generalized problems with arbitrary twice-differentiable loss functions, to which Morishita-Kudo bounds cannot be directly applied; 3) We experimentally showed that i) our direct optimization method improves the convergence rate and stability, and ii) L1-penalized logistic regression (L1-LogReg) by our method identifies a smaller subgraph set, keeping the competitive performance, iii) the learned subgraphs by L1-LogReg are more size-balanced than competing methods, which are biased to small-sized subgraphs.
Collapse
|
12
|
Bennun SV, Hizal DB, Heffner K, Can O, Zhang H, Betenbaugh MJ. Systems Glycobiology: Integrating Glycogenomics, Glycoproteomics, Glycomics, and Other ‘Omics Data Sets to Characterize Cellular Glycosylation Processes. J Mol Biol 2016; 428:3337-3352. [PMID: 27423401 DOI: 10.1016/j.jmb.2016.07.005] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2016] [Revised: 07/05/2016] [Accepted: 07/07/2016] [Indexed: 12/17/2022]
|
13
|
Shen D, Shen H, Bhamidi S, Maldonado YM, Kim Y, Marron JS. Functional Data Analysis of Tree Data Objects. J Comput Graph Stat 2014; 23:418-438. [PMID: 25346588 DOI: 10.1080/10618600.2013.786943] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Data analysis on non-Euclidean spaces, such as tree spaces, can be challenging. The main contribution of this paper is establishment of a connection between tree data spaces and the well developed area of Functional Data Analysis (FDA), where the data objects are curves. This connection comes through two tree representation approaches, the Dyck path representation and the branch length representation. These representations of trees in Euclidean spaces enable us to exploit the power of FDA to explore statistical properties of tree data objects. A major challenge in the analysis is the sparsity of tree branches in a sample of trees. We overcome this issue by using a tree pruning technique that focuses the analysis on important underlying population structures. This method parallels scale-space analysis in the sense that it reveals statistical properties of tree structured data over a range of scales. The effectiveness of these new approaches is demonstrated by some novel results obtained in the analysis of brain artery trees. The scale space analysis reveals a deeper relationship between structure and age. These methods are the first to find a statistically significant gender difference.
Collapse
Affiliation(s)
- Dan Shen
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill Chapel Hill, NC 27599 ; Department of Biostatistics, University of North Carolina at Chapel Hill Chapel Hill, NC 27599
| | - Haipeng Shen
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill Chapel Hill, NC 27599
| | - Shankar Bhamidi
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill Chapel Hill, NC 27599
| | | | - Yongdai Kim
- Department of Statistics, Seoul National University, South Korea
| | - J S Marron
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill Chapel Hill, NC 27599 ; Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill Chapel Hill, NC 27599
| |
Collapse
|
14
|
Tang H, Mayampurath A, Yu C, Mechref Y. Bioinformatics Protocols in Glycomics and Glycoproteomics. ACTA ACUST UNITED AC 2014; 76:2.15.1-2.15.7. [DOI: 10.1002/0471140864.ps0215s76] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Haixu Tang
- School of Informatics and Computing, Indiana University Bloomington Indiana
| | - Anoop Mayampurath
- School of Informatics and Computing, Indiana University Bloomington Indiana
| | - Chuan‐Yih Yu
- School of Informatics and Computing, Indiana University Bloomington Indiana
| | - Yehia Mechref
- Department of Chemistry and Biochemistry, Texas Tech University Lubbock Texas
| |
Collapse
|
15
|
Sánchez-Rodríguez MI, Caridad JM. Modeling and partial least squares approaches in OODA. Biom J 2014; 56:771-3. [PMID: 24652826 DOI: 10.1002/bimj.201300178] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2013] [Revised: 12/26/2013] [Accepted: 01/04/2014] [Indexed: 11/06/2022]
Abstract
This is a discussion of the following paper: "Overview of object oriented data analysis" by J. Steve Marron and Andrés M. Alonso.
Collapse
Affiliation(s)
| | - José M Caridad
- Department of Statistics, Econometrics and Business, University of Córdoba, Spain
| |
Collapse
|
16
|
Abstract
BACKGROUND The glycomics field has made great advancements in the last decade due to technologies for their synthesis and analysis including carbohydrate microarrays. Accordingly, databases for glycomics research have also emerged and been made publicly available by many major institutions worldwide. OBJECTIVE This review introduces these and other useful databases on which new methods for drug discovery can be developed. METHODS The scope of this review covers current documented and accessible databases and resources pertaining to glycomics. These were selected with the expectation that they may be useful for drug discovery research. RESULTS/CONCLUSION There is a plethora of glycomics databases that have much potential for drug discovery. This may seem daunting at first but this review helps to put some of these resources into perspective. Additionally, some thoughts on how to integrate these resources to allow more efficient research are presented.
Collapse
Affiliation(s)
- Kiyoko F Aoki-Kinoshita
- Associate Professor, Department of Bioinformatics, Faculty of Engineering, Soka University, 1-236 Tangi-cho, Hachioji, Tokyo, 192-8577, Japan +81 42 691 4116 ; +81 42 691 4116 ;
| |
Collapse
|
17
|
|
18
|
Jiang H, Aoki-Kinoshita KF, Ching WK. Extracting glycan motifs using a biochemicallyweighted kernel. Bioinformation 2011; 7:405-12. [PMID: 22347783 PMCID: PMC3280441 DOI: 10.6026/97320630007405] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2011] [Accepted: 12/07/2011] [Indexed: 11/28/2022] Open
Abstract
Carbohydrates, or glycans, are one of the most abundant and structurally diverse biopolymers constitute the third major class of biomolecules, following DNA and proteins. However, the study of carbohydrate sugar chains has lagged behind compared to that of DNA and proteins, mainly due to their inherent structural complexity. However, their analysis is important because they serve various important roles in biological processes, including signaling transduction and cellular recognition. In order to glean some light into glycan function based on carbohydrate structure, kernel methods have been developed in the past, in particular to extract potential glycan biomarkers by classifying glycan structures found in different tissue samples. The recently developed weighted qgram method (LK-method) exhibits good performance on glycan structure classification while having limitations in feature selection. That is, it was unable to extract biologically meaningful features from the data. Therefore, we propose a biochemicallyweighted tree kernel (BioLK-method) which is based on a glycan similarity matrix and also incorporates biochemical information of individual q-grams in constructing the kernel matrix. We further applied our new method for the classification and recognition of motifs on publicly available glycan data. Our novel tree kernel (BioLK-method) using a Support Vector Machine (SVM) is capable of detecting biologically important motifs accurately while LK-method failed to do so. It was tested on three glycan data sets from the Consortium for Functional Glycomics (CFG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) GLYCAN and showed that the results are consistent with the literature. The newly developed BioLK-method also maintains comparable classification performance with the LK-method. Our results obtained here indicate that the incorporation of biochemical information of q-grams further shows the flexibility and capability of the novel kernel in feature extraction, which may aid in the prediction of glycan biomarkers.
Collapse
Affiliation(s)
- Hao Jiang
- Advanced Modeling and Applied Computing Laboratory, Department of Mathematics, University of Hong Kong, Pokfulam Road, Hong Kong
| | | | - Wai-Ki Ching
- Advanced Modeling and Applied Computing Laboratory, Department of Mathematics, University of Hong Kong, Pokfulam Road, Hong Kong
| |
Collapse
|
19
|
Xuan P, Zhang Y, Tzeng TRJ, Wan XF, Luo F. A quantitative structure-activity relationship (QSAR) study on glycan array data to determine the specificities of glycan-binding proteins. Glycobiology 2011; 22:552-60. [PMID: 22156918 DOI: 10.1093/glycob/cwr163] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Advances in glycan array technology have provided opportunities to automatically and systematically characterize the binding specificities of glycan-binding proteins. However, there is still a lack of robust methods for such analyses. In this study, we developed a novel quantitative structure-activity relationship (QSAR) method to analyze glycan array data. We first decomposed glycan chains into mono-, di-, tri- or tetrasaccharide subtrees. The bond information was incorporated into subtrees to help distinguish glycan chain structures. Then, we performed partial least-squares (PLS) regression on glycan array data using the subtrees as features. The application of QSAR to the glycan array data of different glycan-binding proteins demonstrated that PLS regression using subtree features can obtain higher R(2) values and a higher percentage of variance explained in glycan array intensities. Based on the regression coefficients of PLS, we were able to effectively identify subtrees that indicate the binding specificities of a glycan-binding protein. Our approach will facilitate the glycan-binding specificity analysis using the glycan array. A user-friendly web tool of the QSAR method is available at http://bci.clemson.edu/tools/glycan_array.
Collapse
Affiliation(s)
- Pengfei Xuan
- School of Computing, Clemson University, Clemson, SC 29634, USA
| | | | | | | | | |
Collapse
|
20
|
Fukagawa D, Tamura T, Takasu A, Tomita E, Akutsu T. A clique-based method for the edit distance between unordered trees and its application to analysis of glycan structures. BMC Bioinformatics 2011; 12 Suppl 1:S13. [PMID: 21342542 PMCID: PMC3044267 DOI: 10.1186/1471-2105-12-s1-s13] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Measuring similarities between tree structured data is important for analysis of RNA secondary structures, phylogenetic trees, glycan structures, and vascular trees. The edit distance is one of the most widely used measures for comparison of tree structured data. However, it is known that computation of the edit distance for rooted unordered trees is NP-hard. Furthermore, there is almost no available software tool that can compute the exact edit distance for unordered trees. RESULTS In this paper, we present a practical method for computing the edit distance between rooted unordered trees. In this method, the edit distance problem for unordered trees is transformed into the maximum clique problem and then efficient solvers for the maximum clique problem are applied. We applied the proposed method to similar structure search for glycan structures. The result suggests that our proposed method can efficiently compute the edit distance for moderate size unordered trees. It also suggests that the proposed method has the accuracy comparative to those by the edit distance for ordered trees and by an existing method for glycan search. CONCLUSIONS The proposed method is simple but useful for computation of the edit distance between unordered trees. The object code is available upon request.
Collapse
Affiliation(s)
- Daiji Fukagawa
- Faculty of Culture and Information Science, Doshisha University, Kyoto 610-0394, Japan
| | - Takeyuki Tamura
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | | | - Etsuji Tomita
- University of Electro-Communications, Tokyo 182-8585, Japan
- Research and Development Initiative, Chuo University, Tokyo 112-8551, Japan
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| |
Collapse
|
21
|
Frank M, Schloissnig S. Bioinformatics and molecular modeling in glycobiology. Cell Mol Life Sci 2010; 67:2749-72. [PMID: 20364395 PMCID: PMC2912727 DOI: 10.1007/s00018-010-0352-4] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2009] [Revised: 03/08/2010] [Accepted: 03/11/2010] [Indexed: 12/11/2022]
Abstract
The field of glycobiology is concerned with the study of the structure, properties, and biological functions of the family of biomolecules called carbohydrates. Bioinformatics for glycobiology is a particularly challenging field, because carbohydrates exhibit a high structural diversity and their chains are often branched. Significant improvements in experimental analytical methods over recent years have led to a tremendous increase in the amount of carbohydrate structure data generated. Consequently, the availability of databases and tools to store, retrieve and analyze these data in an efficient way is of fundamental importance to progress in glycobiology. In this review, the various graphical representations and sequence formats of carbohydrates are introduced, and an overview of newly developed databases, the latest developments in sequence alignment and data mining, and tools to support experimental glycan analysis are presented. Finally, the field of structural glycoinformatics and molecular modeling of carbohydrates, glycoproteins, and protein-carbohydrate interaction are reviewed.
Collapse
Affiliation(s)
- Martin Frank
- Molecular Structure Analysis Core Facility-W160, Deutsches Krebsforschungszentrum (German Cancer Research Centre), 69120 Heidelberg, Germany.
| | | |
Collapse
|
22
|
Cerulo L, Elkan C, Ceccarelli M. Learning gene regulatory networks from only positive and unlabeled data. BMC Bioinformatics 2010; 11:228. [PMID: 20444264 PMCID: PMC2887423 DOI: 10.1186/1471-2105-11-228] [Citation(s) in RCA: 70] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2009] [Accepted: 05/05/2010] [Indexed: 11/16/2022] Open
Abstract
Background Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This approach has been proven to outperform previous unsupervised methods. However, the supervised approach raises open questions. In particular, although known regulatory connections can safely be assumed to be positive training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not available that a given pair of genes do not interact. Results A recent advance in research on data mining is a method capable of learning a classifier from only positive and unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory networks, we show that this method significantly outperforms the current state of the art of machine learning methods. We assess the new method using both simulated and experimental data, and obtain major performance improvement. Conclusions Compared to unsupervised methods for gene network inference, supervised methods are potentially more accurate, but for training they need a complete set of known regulatory connections. A supervised method that can be trained using only positive and unlabeled data, as presented in this paper, is especially beneficial for the task of inferring gene regulatory networks, because only an incomplete set of known regulatory connections is available in public databases such as RegulonDB, TRRD, KEGG, Transfac, and IPA.
Collapse
Affiliation(s)
- Luigi Cerulo
- Department of Biological and Environmental Studies, University of Sannio, Benevento, Italy.
| | | | | |
Collapse
|
23
|
Li L, Ching WK, Yamaguchi T, Aoki-Kinoshita KF. A weighted q-gram method for glycan structure classification. BMC Bioinformatics 2010; 11 Suppl 1:S33. [PMID: 20122206 PMCID: PMC3009505 DOI: 10.1186/1471-2105-11-s1-s33] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Glycobiology pertains to the study of carbohydrate sugar chains, or glycans, in a particular cell or organism. Many computational approaches have been proposed for analyzing these complex glycan structures, which are chains of monosaccharides. The monosaccharides are linked to one another by glycosidic bonds, which can take on a variety of comformations, thus forming branches and resulting in complex tree structures. The q-gram method is one of these recent methods used to understand glycan function based on the classification of their tree structures. This q-gram method assumes that for a certain q, different q-grams share no similarity among themselves. That is, that if two structures have completely different components, then they are completely different. However, from a biological standpoint, this is not the case. In this paper, we propose a weighted q-gram method to measure the similarity among glycans by incorporating the similarity of the geometric structures, monosaccharides and glycosidic bonds among q-grams. In contrast to the traditional q-gram method, our weighted q-gram method admits similarity among q-grams for a certain q. Thus our new kernels for glycan structure were developed and then applied in SVMs to classify glycans. Results Two glycan datasets were used to compare the weighted q-gram method and the original q-gram method. The results show that the incorporation of q-gram similarity improves the classification performance for all of the important glycan classes tested. Conclusion The results in this paper indicate that similarity among q-grams obtained from geometric structure, monosaccharides and glycosidic linkage contributes to the glycan function classification. This is a big step towards the understanding of glycan function based on their complex structures.
Collapse
Affiliation(s)
- Limin Li
- Advanced Modeling and Applied Computing Laboratory, Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong.
| | | | | | | |
Collapse
|
24
|
|
25
|
Aydın B, Pataki G, Wang H, Bullitt E, Marron JS. A principal component analysis for trees. Ann Appl Stat 2009. [DOI: 10.1214/09-aoas263] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
26
|
Ozen A, Gönen M, Alpaydan E, Haliloğlu T. Machine learning integration for predicting the effect of single amino acid substitutions on protein stability. BMC STRUCTURAL BIOLOGY 2009; 9:66. [PMID: 19840377 PMCID: PMC2777163 DOI: 10.1186/1472-6807-9-66] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/03/2009] [Accepted: 10/19/2009] [Indexed: 11/10/2022]
Abstract
BACKGROUND Computational prediction of protein stability change due to single-site amino acid substitutions is of interest in protein design and analysis. We consider the following four ways to improve the performance of the currently available predictors: (1) We include additional sequence- and structure-based features, namely, the amino acid substitution likelihoods, the equilibrium fluctuations of the alpha- and beta-carbon atoms, and the packing density. (2) By implementing different machine learning integration approaches, we combine information from different features or representations. (3) We compare classification vs. regression methods to predict the sign vs. the output of stability change. (4) We allow a reject option for doubtful cases where the risk of misclassification is high. RESULTS We investigate three different approaches: early, intermediate and late integration, which respectively combine features, kernels over feature subsets, and decisions. We perform simulations on two data sets: (1) S1615 is used in previous studies, (2) S2783 is the updated version (as of July 2, 2009) extracted also from ProTherm. For S1615 data set, our highest accuracy using both sequence and structure information is 0.842 on cross-validation and 0.904 on testing using early integration. Newly added features, namely, local compositional packing and the mobility extent of the mutated residues, improve accuracy significantly with intermediate integration. For S2783 data set, we also train regression methods to estimate not only the sign but also the amount of stability change and apply risk-based classification to reject when the learner has low confidence and the loss of misclassification is high. The highest accuracy is 0.835 on cross-validation and 0.832 on testing using only sequence information. The percentage of false positives can be decreased to less than 0.005 by rejecting 10 per cent using late integration. CONCLUSION We find that in both early and late integration, combining inputs or decisions is useful in increasing accuracy. Intermediate integration allows assessing the contributions of individual features by looking at the assigned weights. Overall accuracy of regression is not better than that of classification but it has less false positives, especially when combined with the reject option. The server for stability prediction for three integration approaches and the data sets are available at http://www.prc.boun.edu.tr/appserv/prc/mlsta.
Collapse
Affiliation(s)
- Ayşegül Ozen
- Department of Chemical Engineering, Polymer Research Center, Boğaziçi University, Istanbul, Turkey.
| | | | | | | |
Collapse
|
27
|
Hashimoto K, Takigawa I, Shiga M, Kanehisa M, Mamitsuka H. Mining significant tree patterns in carbohydrate sugar chains. Bioinformatics 2008; 24:i167-73. [PMID: 18689820 DOI: 10.1093/bioinformatics/btn293] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
MOTIVATION Carbohydrate sugar chains or glycans, the third major class of macromolecules, hold branch shaped tree structures. Glycan motifs are known to be two types: (1) conserved patterns called 'cores' containing the root and (2) ubiquitous motifs which appear in external parts including leaves and are distributed over different glycan classes. Finding these glycan tree motifs is an important issue, but there have been no computational methods to capture these motifs efficiently. RESULTS We have developed an efficient method for mining motifs or significant subtrees from glycans. The key contribution of this method is: (1) to have proposed a new concept, 'á-closed frequent subtrees', and an efficient method for mining all these subtrees from given trees and (2) to have proposed to apply statistical hypothesis testing to rerank the frequent subtrees in significance. We experimentally verified the effectiveness of the proposed method using real glycans: (1)We examined the top 10 subtrees obtained by our method at some parameter setting and confirmed that all subtrees are significant motifs in glycobiology. (2) We applied the results of our method to a classification problem and found that our method outperformed other competing methods, SVM with three different tree kernels, being all statistically significant. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kosuke Hashimoto
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji 611-0011, Japan
| | | | | | | | | |
Collapse
|
28
|
|