1
|
Ma R, Liu Z, Zhang Q, Liu Z, Luo T. Evaluating Polymer Representations via Quantifying Structure-Property Relationships. J Chem Inf Model 2019; 59:3110-3119. [PMID: 31268306 DOI: 10.1021/acs.jcim.9b00358] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Machine learning techniques are being applied in quantifying structure-property relationships for a wide variety of materials, where the properly represented materials play key roles. Although algorithms for representation learning are extensively studied, their applications to domain-specific areas, such as polymers, are limited largely due to the lack of benchmark databases. In this work, we investigate different types of polymer representations, including Morgan fingerprint (MF), molecular embedding (ME), and molecular graph (MG), based on the benchmark database from a subset of the well-known web-based polymer databases, PolyInfo. We evaluate the quality of different polymer representations via quantifying the relationships between the representations and polymer properties, including density, melting temperature, and glass transition temperature. Different representation learning schemes for MEs, such as supervised learning, semisupervised learning, and transfer learning, are investigated. In supervised learning, only labeled molecules in our benchmark database are used for representation learning, in semisupervised learning, both labeled and unlabeled molecules in our benchmark database are used, and in transfer learning, molecules from an external database that is different from the benchmark database are used for representation learning. It is found that ME (with the R2 of 0.724 in the density case, 0.684 in the melting temperature case, and 0.865 in the glass transition temperature case) outperforms the other representations for structure-property relationship quantification in all cases studied, and MG (with the R2 of 0.260 in the density case, -0.149 in the melting temperature case, and 0.711 in the glass transition case) is shown to be much inferior to ME and MF (with the R2 of 0.562 in the density case, 0.645 in the melting temperature case, and 0.849 in the glass transition case), likely due to the relatively small volumes of training data available. For MEs, it is found that the similarities of substructure MEs under different learning schemes (e.g., SL, SSL, and TL) are differently estimated, thus leading to different performance scores in structure-property relation quantification. Combinations of MEs show little effect on predictive performance when comparing to the single MEs in the corresponding regression tasks, proving no information gain in mixing MEs.
Collapse
Affiliation(s)
- Ruimin Ma
- Department of Aerospace and Mechanical Engineering , University of Notre Dame , Notre Dame , Indiana 46556 , United States
| | - Zeyu Liu
- Department of Aerospace and Mechanical Engineering , University of Notre Dame , Notre Dame , Indiana 46556 , United States
| | - Quanwei Zhang
- Department of Aerospace and Mechanical Engineering , University of Notre Dame , Notre Dame , Indiana 46556 , United States
| | - Zhiyu Liu
- Department of Aerospace and Mechanical Engineering , University of Notre Dame , Notre Dame , Indiana 46556 , United States
| | - Tengfei Luo
- Department of Aerospace and Mechanical Engineering , University of Notre Dame , Notre Dame , Indiana 46556 , United States.,Department of Chemical and Biomolecular Engineering , University of Notre Dame , Notre Dame , Indiana 46556 , United States
| |
Collapse
|
2
|
Application of genomics, proteomics and metabolomics in drug discovery, development and clinic. Ther Deliv 2013; 4:395-413. [PMID: 23442083 DOI: 10.4155/tde.13.4] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Genomics, proteomics and metabolomics are three areas that are routinely applied throughout the drug-development process as well as after a product enters the market. This review discusses all three 'omics, reporting on the key applications, techniques, recent advances and expectations of each. Genomics, mainly through the use of novel and next-generation sequencing techniques, has advanced areas of drug discovery and development through the comparative assessment of normal and diseased-state tissues, transcription and/or expression profiling, side-effect profiling, pharmacogenomics and the identification of biomarkers. Proteomics, through techniques including isotope coded affinity tags, stable isotopic labeling by amino acids in cell culture, isobaric tags for relative and absolute quantification, multidirectional protein identification technology, activity-based probes, protein/peptide arrays, phage displays and two-hybrid systems is utilized in multiple areas through the drug development pipeline including target and lead identification, compound optimization, throughout the clinical trials process and after market analysis. Metabolomics, although the most recent and least developed of the three 'omics considered in this review, provides a significant contribution to drug development through systems biology approaches. Already implemented to some degree in the drug-discovery industry and used in applications spanning target identification through to toxicological analysis, metabolic network understanding is essential in generating future discoveries.
Collapse
|
3
|
Ghosh A, Chattopadhyay S, Chawla-Sarkar M, Nandy P, Nandy A. In silico study of rotavirus VP7 surface accessible conserved regions for antiviral drug/vaccine design. PLoS One 2012; 7:e40749. [PMID: 22844409 PMCID: PMC3406019 DOI: 10.1371/journal.pone.0040749] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Accepted: 06/12/2012] [Indexed: 11/23/2022] Open
Abstract
Background Rotaviral diarrhoea kills about half a million children annually in developing countries and accounts for one third of diarrhea related hospitalizations. Drugs and vaccines against the rotavirus are handicapped, as in all viral diseases, by the rapid mutational changes that take place in the DNA and protein sequences rendering most of these ineffective. As of now only two vaccines are licensed and approved by the WHO (World Health Organization), but display reduced efficiencies in the underdeveloped countries where the disease is more prevalent. We approached this issue by trying to identify regions of surface exposed conserved segments on the surface glycoproteins of the virion, which may then be targeted by specific peptide vaccines. We had developed a bioinformatics protocol for these kinds of problems with reference to the influenza neuraminidase protein, which we have refined and expanded to analyze the rotavirus issue. Results Our analysis of 433 VP7 (Viral Protein 7 from rotavirus) surface protein sequences across 17 subtypes encompassing mammalian hosts using a 20D Graphical Representation and Numerical Characterization method, identified four possible highly conserved peptide segments. Solvent accessibility prediction servers were used to identify that these are predominantly surface situated. These regions analyzed through selected epitope prediction servers for their epitopic properties towards possible T-cell and B-cell activation showed good results as epitopic candidates (only dry lab confirmation). Conclusions The main reasons for the development of alternative vaccine strategies for the rotavirus are the failure of current vaccines and high production costs that inhibit their application in developing countries. We expect that it would be possible to use the protein surface exposed regions identified in our study as targets for peptide vaccines and drug designs for stable immunity against divergent strains of the rotavirus. Though this study is fully dependent on computational prediction algorithms, it provides a platform for wet lab experiments.
Collapse
Affiliation(s)
- Ambarnil Ghosh
- Physics Department, Jadavpur University, Kolkata, West Bengal, India
| | - Shiladitya Chattopadhyay
- Division of Virology, National Institute of Cholera and Enteric Diseases, Kolkata, West Bengal, India
| | - Mamta Chawla-Sarkar
- Division of Virology, National Institute of Cholera and Enteric Diseases, Kolkata, West Bengal, India
| | - Papiya Nandy
- Physics Department, Jadavpur University, Kolkata, West Bengal, India
| | - Ashesh Nandy
- Centre for Interdisciplinary Research and Education, Kolkata, West Bengal, India
- * E-mail:
| |
Collapse
|
4
|
Aguiar-Pulido V, Munteanu CR, Seoane JA, Fernández-Blanco E, Pérez-Montoto LG, González-Díaz H, Dorado J. Naïve Bayes QSDR classification based on spiral-graph Shannon entropies for protein biomarkers in human colon cancer. MOLECULAR BIOSYSTEMS 2012; 8:1716-22. [PMID: 22466084 DOI: 10.1039/c2mb25039j] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Fast cancer diagnosis represents a real necessity in applied medicine due to the importance of this disease. Thus, theoretical models can help as prediction tools. Graph theory representation is one option because it permits us to numerically describe any real system such as the protein macromolecules by transforming real properties into molecular graph topological indices. This study proposes a new classification model for proteins linked with human colon cancer by using spiral graph topological indices of protein amino acid sequences. The best quantitative structure-disease relationship model is based on eleven Shannon entropy indices. It was obtained with the Naïve Bayes method and shows excellent predictive ability (90.92%) for new proteins linked with this type of cancer. The statistical analysis confirms that this model allows diagnosing the absence of human colon cancer obtaining an area under receiver operating characteristic of 0.91. The methodology presented can be used for any type of sequential information such as any protein and nucleic acid sequence.
Collapse
Affiliation(s)
- Vanessa Aguiar-Pulido
- Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain
| | | | | | | | | | | | | |
Collapse
|
5
|
Agüero-Chapin G, de la Riva GA, Molina-Ruiz R, Sánchez-Rodríguez A, Pérez-Machado G, Vasconcelos V, Antunes A. Non-linear models based on simple topological indices to identify RNase III protein members. J Theor Biol 2010; 273:167-78. [PMID: 21192951 DOI: 10.1016/j.jtbi.2010.12.019] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2010] [Revised: 11/15/2010] [Accepted: 12/13/2010] [Indexed: 01/27/2023]
Abstract
Alignment-free classifiers are especially useful in the functional classification of protein classes with variable homology and different domain structures. Thus, the Topological Indices to BioPolymers (TI2BioP) methodology (Agüero-Chapin et al., 2010) inspired in both the TOPS-MODE and the MARCH-INSIDE methodologies allows the calculation of simple topological indices (TIs) as alignment-free classifiers. These indices were derived from the clustering of the amino acids into four classes of hydrophobicity and polarity revealing higher sequence-order information beyond the amino acid composition level. The predictability power of such TIs was evaluated for the first time on the RNase III family, due to the high diversity of its members (primary sequence and domain organization). Three non-linear models were developed for RNase III class prediction: Decision Tree Model (DTM), Artificial Neural Networks (ANN)-model and Hidden Markov Model (HMM). The first two are alignment-free approaches, using TIs as input predictors. Their performances were compared with a non-classical HMM, modified according to our amino acid clustering strategy. The alignment-free models showed similar performances on the training and the test sets reaching values above 90% in the overall classification. The non-classical HMM showed the highest rate in the classification with values above 95% in training and 100% in test. Although the higher accuracy of the HMM, the DTM showed simplicity for the RNase III classification with low computational cost. Such simplicity was evaluated in respect to HMM and ANN models for the functional annotation of a new bacterial RNase III class member, isolated and annotated by our group.
Collapse
Affiliation(s)
- Guillermin Agüero-Chapin
- CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Rua dos Bragas, 177, 4050-123 Porto, Portugal
| | | | | | | | | | | | | |
Collapse
|
6
|
Wu ZC, Xiao X, Chou KC. 2D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids. J Theor Biol 2010; 267:29-34. [DOI: 10.1016/j.jtbi.2010.08.007] [Citation(s) in RCA: 104] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2010] [Revised: 08/04/2010] [Accepted: 08/04/2010] [Indexed: 11/15/2022]
|
7
|
Semmar N. A New Mixture Design-Based Approach to Graphical Screening of Potential Interconnections and Variability Processes in Metabolic Systems. Chem Biol Drug Des 2010; 75:91-105. [DOI: 10.1111/j.1747-0285.2009.00912.x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
8
|
Pérez-Montoto LG, Santana L, González-Díaz H. Scoring function for DNA-drug docking of anticancer and antiparasitic compounds based on spectral moments of 2D lattice graphs for molecular dynamics trajectories. Eur J Med Chem 2009; 44:4461-9. [PMID: 19604606 PMCID: PMC7127518 DOI: 10.1016/j.ejmech.2009.06.011] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2009] [Revised: 06/04/2009] [Accepted: 06/05/2009] [Indexed: 02/02/2023]
Abstract
We introduce here a new class of invariants for MD trajectories based on the spectral moments pi(k)(L) of the Markov matrix associated to lattice network-like (LN) graph representations of Molecular Dynamics (MD) trajectories. The procedure embeds the MD energy profiles on a 2D Cartesian coordinates system using simple heuristic rules. At the same time, we associate the LN with a Markov matrix that describes the probabilities of passing from one state to other in the new 2D space. We construct this type of LNs for 422 MD trajectories obtained in DNA-drug docking experiments of 57 furocoumarins. The combined use of psoralens+ultraviolet light (UVA) radiation is known as PUVA therapy. PUVA is effective in the treatment of skin diseases such as psoriasis and mycosis fungoides. PUVA is also useful to treat human platelet (PTL) concentrates in order to eliminate Leishmania spp. and Trypanosoma cruzi. Both are parasites that cause Leishmaniosis (a dangerous skin and visceral disease) and Chagas disease, respectively; and may circulate in blood products collected from infected donors. We included in this study both lineal (psoralens) and angular (angelicins) furocoumarins. In the study, we grouped the LNs on two sets; set1: DNA-drug complex MD trajectories for active compounds and set2: MD trajectories of non-active compounds or no-optimal MD trajectories of active compounds. We calculated the respective pi(k)(L) values for all these LNs and used them as inputs to train a new classifier that discriminate set1 from set2 cases. In training series the model correctly classifies 79 out of 80 (specificity=98.75%) set1 and 226 out of 238 (Sensitivity=94.96%) set2 trajectories. In independent validation series the model correctly classifies 26 out of 26 (specificity=100%) set1 and 75 out of 78 (sensitivity=96.15%) set2 trajectories. We propose this new model as a scoring function to guide DNA-docking studies in the drug design of new coumarins for anticancer or antiparasitic PUVA therapy.
Collapse
Affiliation(s)
- Lázaro G. Pérez-Montoto
- Department of Microbiology & Parasitology, and Department of Organic Chemistry
- Faculty of Pharmacy, University of Santiago de Compostela, 15782, Spain
| | - Lourdes Santana
- Faculty of Pharmacy, University of Santiago de Compostela, 15782, Spain
| | - Humberto González-Díaz
- Department of Microbiology & Parasitology, and Department of Organic Chemistry
- Faculty of Pharmacy, University of Santiago de Compostela, 15782, Spain
| |
Collapse
|
9
|
Generalized lattice graphs for 2D-visualization of biological information. J Theor Biol 2009; 261:136-47. [PMID: 19646452 PMCID: PMC7094121 DOI: 10.1016/j.jtbi.2009.07.029] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2009] [Revised: 07/18/2009] [Accepted: 07/20/2009] [Indexed: 01/09/2023]
Abstract
Several graph representations have been introduced for different data in theoretical biology. For instance, complex networks based on Graph theory are used to represent the structure and/or dynamics of different large biological systems such as protein–protein interaction networks. In addition, Randic, Liao, Nandy, Basak, and many others developed some special types of graph-based representations. This special type of graph includes geometrical constrains to node positioning in space and adopts final geometrical shapes that resemble lattice-like patterns. Lattice networks have been used to visually depict DNA and protein sequences but they are very flexible. However, despite the proved efficacy of new lattice-like graph/networks to represent diverse systems, most works focus on only one specific type of biological data. This work proposes a generalized type of lattice and illustrates how to use it in order to represent and compare biological data from different sources. We exemplify the following cases: protein sequence; mass spectra (MS) of protein peptide mass fingerprints (PMF); molecular dynamic trajectory (MDTs) from structural studies; mRNA microarray data; single nucleotide polymorphisms (SNPs); 1D or 2D-Electrophoresis study of protein polymorphisms and protein-research patent and/or copyright information. We used data available from public sources for some examples but for other, we used experimental results reported herein for the first time. This work may break new ground for the application of Graph theory in theoretical biology and other areas of biomedical sciences.
Collapse
|
10
|
Pérez-Montoto LG, Dea-Ayuela MA, Prado-Prado FJ, Bolas-Fernández F, Ubeira FM, González-Díaz H. Study of peptide fingerprints of parasite proteins and drug-DNA interactions with Markov-Mean-Energy invariants of biopolymer molecular-dynamic lattice networks. POLYMER 2009; 50:3857-3870. [PMID: 32287404 PMCID: PMC7111648 DOI: 10.1016/j.polymer.2009.05.055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2009] [Revised: 05/06/2009] [Accepted: 05/14/2009] [Indexed: 11/26/2022]
Abstract
Since the advent of Molecular Dynamics (MD) in biopolymers science with the study by Karplus et al. on protein dynamics, MD has become the by foremost well established, computational technique to investigate structure and function of biomolecules and their respective complexes and interactions. The analysis of the MD trajectories (MDTs) remains, however, the greatest challenge and requires a great deal of insight, experience, and effort. Here, we introduce a new class of invariants for MDTs based on the spatial distribution of Mean-Energy values ξk (L) on a 2D Euclidean space representation of the MDTs. The procedure forces one MD trajectory to fold into a 2D Cartesian coordinates system using a step-by-step procedure driven by simple rules. The ξk (L) values are invariants of a Markov matrix (1 Π), which describes the probabilities of transition between two states in the new 2D space; which is associated to a graph representation of MDTs similar to the lattice networks (LNs) of DNA and protein sequences. We also introduce a new algorithm to perform phylogenetic analysis of peptides based on MDTs instead of the sequence of the polypeptide. In a first experiment, we illustrate this algorithm for 35 peptides present on the Peptide Mass Fingerprint (PMF) of a new protein of Leishmania infantum studied in this work. We report, by the first time, 2D Electrophoresis isolation, MALDI TOF Mass Spectroscopy characterization, and MASCOT search results for this PMF. In a second experiment, we construct the LNs for 422 MDTs obtained in DNA-Drug Docking simulations of the interaction of 57 anticancer furocoumarins with a DNA oligonucleotide. We calculated the respective ξk (L) values for all these LNs and used them as inputs to train a new classifier with Accuracy = 85.44% and 84.91% in training and validation respectively. The new model can be used as scoring function to guide DNA-Drug Docking studies in drug design of new coumarins for PUVA therapy. The new phylogenetics analysis algorithms encode information different from sequence similarity and may be used to analyze MDTs obtained in Docking or modeling experiments for any classes of biopolymers. The work opens new perspective on the analysis and applications of MD in polymer sciences.
Collapse
Affiliation(s)
- Lázaro Guillermo Pérez-Montoto
- Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain,Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - María Auxiliadora Dea-Ayuela
- Departamento de Atención Sanitaria, Salud Pública y Sanidad Animal, Facultad CC Experimentales y de La Salud, Universidad CEU Cardenal Herrera, 46113 Moncada (Valencia), Spain
| | - Francisco J. Prado-Prado
- Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain,Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | | | - Florencio M. Ubeira
- Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - Humberto González-Díaz
- Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain,Corresponding author. Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| |
Collapse
|