1
|
Zhu CJ, Song M, Liu Q, Becquey C, Bi J. Benchmark on Indexing Algorithms for Accelerating Molecular Similarity Search. J Chem Inf Model 2020; 60:6167-6184. [PMID: 33095006 DOI: 10.1021/acs.jcim.0c00393] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Structurally similar analogues of given query compounds can be rapidly retrieved from chemical databases by the molecular similarity search approaches. However, the computational cost associated with the exhaustive similarity search of a large compound database will be quite high. Although the latest indexing algorithms can greatly speed up the search process, they cannot be readily applicable to molecular similarity search problems due to the lack of Tanimoto similarity metric implementation. In this paper, we first implement Python or C++ codes to enable the Tanimoto similarity search via several recent indexing algorithms, such as Hnsw and Onng. Moreover, there are increasing interests in computational communities to develop robust benchmarking systems to access the performance of various computational algorithms. Here, we provide a benchmark to evaluate the molecular similarity searching performance of these recent indexing algorithms. To avoid the potential package dependency issues, two separate benchmarks are built based on currently popular container technologies, Docker and Singularity. The Singularity container is a rather new container framework specifically designed for the high-performance computing (HPC) platform and does not need the privileged permissions or the separated daemon process. Both benchmarking methods are extensible to incorporate other new indexing algorithms, benchmarking data sets, and different customized parameter settings. Our results demonstrate that the graph-based methods, such as Hnsw and Onng, consistently achieve the best trade-off between searching effectiveness and searching efficiencies. The source code of the entire benchmark systems can be downloaded from https://github.uconn.edu/mldrugdiscovery/MssBenchmark.
Collapse
|
2
|
Vachery J, Ranu S. RISC: Rapid Inverted-Index Based Search of Chemical Fingerprints. J Chem Inf Model 2019; 59:2702-2713. [PMID: 30908028 DOI: 10.1021/acs.jcim.9b00069] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The ability to search for a query molecule on massive molecular repositories is a fundamental task in chemoinformatics and drug-discovery. Chemical fingerprints are commonly used to characterize the structure and properties of molecules. Some fingerprints, particularly unfolded fingerprints, are often of extreme high dimension and sparse where only few features have a positive value. In this work, we propose a new searching algorithm, RISC, which exploits sparsity in high-dimensional fingerprints to derive effective pruning mechanisms and dramatically speed-up searching efficiency. RISC is robust enough to work on both binary and nonbinary chemical fingerprints. Extensive experiments on Range Queries and Top-k Queries across several molecular repositories demonstrate that at fingerprints of dimension 2048 and above, which is often the case with unfolded fingerprints, RISC is consistently faster than the state-of-the-art techniques. The source code of our implementation is available at http://www.cse.iitd.ac.in/~sayan/software.html .
Collapse
Affiliation(s)
- Jithin Vachery
- Department of Computer Science , IIT-Madras , Chennai , 600036 , India
| | - Sayan Ranu
- Department of Computer Science , IIT-Delhi , New Delhi , 110016 , India
| |
Collapse
|
3
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 129] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
4
|
Accurate and efficient target prediction using a potency-sensitive influence-relevance voter. J Cheminform 2015; 7:63. [PMID: 26719774 PMCID: PMC4696267 DOI: 10.1186/s13321-015-0110-6] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Accepted: 12/02/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A number of algorithms have been proposed to predict the biological targets of diverse molecules. Some are structure-based, but the most common are ligand-based and use chemical fingerprints and the notion of chemical similarity. These methods tend to be computationally faster than others, making them particularly attractive tools as the amount of available data grows. RESULTS Using a ChEMBL-derived database covering 490,760 molecule-protein interactions and 3236 protein targets, we conduct a large-scale assessment of the performance of several target-prediction algorithms at predicting drug-target activity. We assess algorithm performance using three validation procedures: standard tenfold cross-validation, tenfold cross-validation in a simulated screen that includes random inactive molecules, and validation on an external test set composed of molecules not present in our database. CONCLUSIONS We present two improvements over current practice. First, using a modified version of the influence-relevance voter (IRV), we show that using molecule potency data can improve target prediction. Second, we demonstrate that random inactive molecules added during training can boost the accuracy of several algorithms in realistic target-prediction experiments. Our potency-sensitive version of the IRV (PS-IRV) obtains the best results on large test sets in most of the experiments. Models and software are publicly accessible through the chemoinformatics portal at http://chemdb.ics.uci.edu/.
Collapse
|
5
|
Saeedipour S, Tai D, Fang J. ChemCom: A Software Program for Searching and Comparing Chemical Libraries. J Chem Inf Model 2015; 55:1292-6. [PMID: 26067384 DOI: 10.1021/ci500713s] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
An efficient chemical comparator, a computer application facilitating searching and comparing chemical libraries, is useful in drug discovery and other relevant areas. The need for an efficient and user-friendly chemical comparator prompted us to develop ChemCom (Chemical Comparator) based on Java Web Start (JavaWS) technology. ChemCom provides a user-friendly graphical interface to a number of fast algorithms including a novel algorithm termed UnionBit Tree Algorithm. It utilizes an intuitive stepwise mechanism for selecting chemical comparison parameters before starting the comparison process. UnionBit has shown approximately an 165% speedup on average compared to its closest competitive algorithm implemented in ChemCom over real data. It is approximately 11 times faster than the Open Babel FastSearch algorithm in our tests. ChemCom can be accessed free-of-charge via a user-friendly website at http://bioinformatics.org/chemcom/.
Collapse
Affiliation(s)
- Sirus Saeedipour
- ‡Applied Bioinformatics Laboratory, The University of Kansas, 2034 Becker Drive, Lawrence, Kansas 66047, United States
| | - David Tai
- ‡Applied Bioinformatics Laboratory, The University of Kansas, 2034 Becker Drive, Lawrence, Kansas 66047, United States
| | - Jianwen Fang
- †Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, 9609 Medical Center Dr., Rockville, Maryland 20850, United States.,‡Applied Bioinformatics Laboratory, The University of Kansas, 2034 Becker Drive, Lawrence, Kansas 66047, United States
| |
Collapse
|
6
|
Thiel P, Sach-Peltason L, Ottmann C, Kohlbacher O. Blocked Inverted Indices for Exact Clustering of Large Chemical Spaces. J Chem Inf Model 2014; 54:2395-401. [DOI: 10.1021/ci500150t] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Philipp Thiel
- Applied
Bioinformatics, Center for Bioinformatics, Quantitative Biology Center
and Dept. of Computer Science, University of Tübingen, Sand
14, 72076 Tübingen, Germany
| | - Lisa Sach-Peltason
- Pharma Research & Early Development Informatics, Data Science, F. Hoffmann-La Roche AG, Grenzacherstr. 124, CH-4070 Basel, Switzerland
| | - Christian Ottmann
- Laboratory
of Chemical Biology and Institute of Complex Molecular Systems, Department
of Biomedical Engineering, Technische Universiteit Eindhoven, Den Dolech 2, 5612 AZ Eindhoven, The Netherlands
| | - Oliver Kohlbacher
- Applied
Bioinformatics, Center for Bioinformatics, Quantitative Biology Center
and Dept. of Computer Science, University of Tübingen, Sand
14, 72076 Tübingen, Germany
| |
Collapse
|
7
|
Fingerprint design and engineering strategies: rationalizing and improving similarity search performance. Future Med Chem 2013; 4:1945-59. [PMID: 23088275 DOI: 10.4155/fmc.12.126] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Fingerprints (FPs) are bit or integer string representations of molecular structure and properties, and are popular descriptors for chemical similarity searching. A major goal of similarity searching is the identification of novel active compounds on the basis of known reference molecules. In this review recent FP design and engineering strategies are discussed. New types of FPs continue to be replaced, often applying different design principles. FP engineering techniques have recently been introduced to further improve search performance and computational efficiency and elucidate mechanisms by which FPs recognize active compounds. In addition, through feature selection and hybridization techniques, standard FPs have been transformed into compound class-specific versions with further increased search performance. Moreover, scaffold hopping mechanisms have been explored. FPs will continue to play an important role in the search for novel active compounds.
Collapse
|
8
|
Kristensen TG, Nielsen J, Pedersen CNS. Methods for Similarity-based Virtual Screening. Comput Struct Biotechnol J 2013; 5:e201302009. [PMID: 24688702 PMCID: PMC3962175 DOI: 10.5936/csbj.201302009] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Revised: 01/30/2013] [Accepted: 02/08/2013] [Indexed: 11/22/2022] Open
Abstract
Developing new medical drugs is expensive. Among the first steps is a screening process, in which molecules in existing chemical libraries are tested for activity against a given target. This requires a lot of resources and manpower. Therefore it has become common to perform a virtual screening, where computers are used for predicting the activity of very large libraries of molecules, to identify the most promising leads for further laboratory experiments. Since computer simulations generally require fewer resources than physical experimentation this can lower the cost of medical and biological research significantly. In this paper we review practically fast algorithms for screening databases of molecules in order to find molecules that are sufficiently similar to a query molecule.
Collapse
Affiliation(s)
- Thomas G Kristensen
- Bioinformatics Research Center, Aarhus University, C. F. Møllers Allé 8, DK- 8000 Aarhus C, Denmark ; Now employed by Trifork Gmbh
| | - Jesper Nielsen
- Bioinformatics Research Center, Aarhus University, C. F. Møllers Allé 8, DK- 8000 Aarhus C, Denmark ; Now employed by Google Inc
| | - Christian N S Pedersen
- Bioinformatics Research Center, Aarhus University, C. F. Møllers Allé 8, DK- 8000 Aarhus C, Denmark
| |
Collapse
|
9
|
Ruddigkeit L, Blum LC, Reymond JL. Visualization and virtual screening of the chemical universe database GDB-17. J Chem Inf Model 2013; 53:56-65. [PMID: 23259841 DOI: 10.1021/ci300535x] [Citation(s) in RCA: 78] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
The chemical universe database GDB-17 contains 166.4 billion molecules of up to 17 atoms of C, N, O, S, and halogens obeying rules for chemical stability, synthetic feasibility, and medicinal chemistry. GDB-17 was analyzed using 42 integer value descriptors of molecular structure which we term "Molecular Quantum Numbers" (MQN). Principal component analysis and representation of the (PC1, PC2)-plane provided a graphical overview of the GDB-17 chemical space. Rapid ligand-based virtual screening (LBVS) of GDB-17 using the city-block distance CBD(MQN) as a similarity search measure was enabled by a hashed MQN-fingerprint. LBVS of the entire GDB-17 and of selected subsets identified shape similar, scaffold hopping analogs (ROCS > 1.6 and T(SF) < 0.5) of 15 drugs. Over 97% of these analogs occurred within CBD(MQN) ≤ 12 from each drug, a constraint which might help focus advanced virtual screening. An MQN-searchable 50 million subset of GDB-17 is publicly available at www.gdb.unibe.ch .
Collapse
Affiliation(s)
- Lars Ruddigkeit
- Department of Chemistry and Biochemistry, University of Berne, Freiestrasse 3, 3012 Berne, Switzerland
| | | | | |
Collapse
|