1
|
Banerjee A, Roy K. The multiclass ARKA framework for developing improved q-RASAR models for environmental toxicity endpoints. ENVIRONMENTAL SCIENCE. PROCESSES & IMPACTS 2025. [PMID: 40227888 DOI: 10.1039/d5em00068h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/16/2025]
Abstract
The continuous quest for the quick, accurate, and efficient methods for filling the gaps in the toxicity data of commercial chemicals is the need of the hour. Thus, it has become essential to develop simple and improved modeling strategies that aim to generate more accurate predictions. Recently, quantitative Read-Across Structure-Activity Relationship (q-RASAR) modeling has been reported to enhance the external predictivity of QSAR models. However, the cross-validation metrics of some q-RASAR models show compromised values compared to those of the corresponding QSAR models. We report here an improved q-RASAR workflow coupled with the Arithmetic Residuals in K-groups Analysis (ARKA) framework. This improved workflow (ARKA-RASAR) considers two important aspects: the contribution of different QSAR descriptors to different experimental response ranges, and the identification of similarity among close congeners based on both the selected QSAR descriptors and the contribution of different QSAR descriptors to different experimental response ranges. A simple, free, and user-friendly Java-based tool, Multiclass ARKA-v1.0, has been developed to compute the multiclass ARKA descriptors. In this study, five different toxicity datasets previously used for the development of QSAR and q-RASAR models were considered. We developed hybrid ARKA models that consist of a combination of QSAR descriptors and ARKA descriptors. These hybrid feature spaces were used to compute RASAR descriptors and develop ARKA-RASAR models. We used the same modeling strategies used to develop the previously reported QSAR and q-RASAR models for a fair comparison. Additionally, these modeling algorithms are straightforward, reproducible, and transferable. A multi-criteria decision-making statistical approach, the Sum of Ranking Differences (SRD), indicated that the ARKA-RASAR models are the best-performing models, considering training, test, and cross-validation statistics. The least significant difference procedure ensured that the SRD values were significantly different for most models, presenting an unbiased workflow. True external validation using a set of pesticide metabolites and predicting their early-stage acute fish toxicity using relevant ARKA-RASAR models was also carried out and yielded encouraging results. The promising results and the ease of computation of ARKA and RASAR descriptors using our tools suggest that the ARKA-RASAR modeling framework may be a potential choice for developing highly robust and predictive models for filling the gaps in environmental toxicity data.
Collapse
Affiliation(s)
- Arkaprava Banerjee
- Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700 032, India.
| | - Kunal Roy
- Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700 032, India.
| |
Collapse
|
2
|
Du M, Ren Y, Zhang Y, Li W, Yang H, Chu H, Zhao Y. CSEL-BGC: A Bioinformatics Framework Integrating Machine Learning for Defining the Biosynthetic Evolutionary Landscape of Uncharacterized Antibacterial Natural Products. Interdiscip Sci 2025; 17:27-41. [PMID: 39348072 DOI: 10.1007/s12539-024-00656-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 08/26/2024] [Accepted: 08/28/2024] [Indexed: 10/01/2024]
Abstract
The sluggish pace of new antibacterial drug development reflects a vulnerability in the face of the current severe threat posed by bacterial resistance. Microbial natural products (NPs), as a reservoir of immense chemical potential, have emerged as the most promising avenue for the discovery of next generation antibacterial agent. Directly accessing the antibacterial activity of potential products derived from biosynthetic gene clusters (BGCs) would significantly expedite the process. To tackle this issue, we propose a CSEL-BGC framework that integrates machine learning (ML) techniques. This framework involves the development of a novel cascade-stacking ensemble learning (CSEL) model and the establishment of a groundbreaking model evaluation system. Based on this framework, we predict 6,666 BGCs with antibacterial activity from 3,468 complete bacterial genomes and elucidate a biosynthetic evolutionary landscape to reveal their antibacterial potential. This provides crucial insights for interpretating the synthesis and secretion mechanisms of unknown NPs.
Collapse
Affiliation(s)
- Minghui Du
- School of Life Science and Bio-Pharmaceutics, Shenyang Pharmaceutical University, Shenyang, 110016, China
| | - Yuxiang Ren
- School of Life Science and Bio-Pharmaceutics, Shenyang Pharmaceutical University, Shenyang, 110016, China
| | - Yang Zhang
- School of Life Science and Bio-Pharmaceutics, Shenyang Pharmaceutical University, Shenyang, 110016, China
| | - Wenwen Li
- School of Life Science and Bio-Pharmaceutics, Shenyang Pharmaceutical University, Shenyang, 110016, China
| | - Hongtao Yang
- School of Life Science and Bio-Pharmaceutics, Shenyang Pharmaceutical University, Shenyang, 110016, China
| | - Huiying Chu
- State Key Laboratory of Molecular Reaction Dynamics, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116000, China
| | - Yongshan Zhao
- School of Life Science and Bio-Pharmaceutics, Shenyang Pharmaceutical University, Shenyang, 110016, China.
| |
Collapse
|
3
|
Hossain MM, Roy K. The development of classification-based machine-learning models for the toxicity assessment of chemicals associated with plastic packaging. JOURNAL OF HAZARDOUS MATERIALS 2025; 484:136702. [PMID: 39637787 DOI: 10.1016/j.jhazmat.2024.136702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Revised: 11/24/2024] [Accepted: 11/26/2024] [Indexed: 12/07/2024]
Abstract
Assessing chemical toxicity in materials like plastic packaging is critical to safeguarding public health. This study presents the development of classification-based machine learning models to predict the toxicity of chemicals associated with plastic packaging. Using an extensive dataset of chemical structures, we trained multiple machine learning models-Random Forest, Support Vector Machine, Linear Discriminant Analysis, and Logistic Regression-targeting endpoints such as Neurotoxicity, Hepatotoxicity, Dermatotoxicity, Carcinogenicity, Reproductive Toxicity, Skin Sensitization, and Toxic Pneumonitis. The dataset was pre-processed by selecting 2D molecular descriptors as feature inputs, with resampling methods (ADASYN, Borderline SMOTE, Random Over-sampler, SVMSMOTE Cluster Centroid, Near Miss, Random Under Sampler) applied to balance classes for accurate classification. A five-fold cross-validation technique was used to optimize model performance, with model parameters fine-tuned using grid search. The model performance was evaluated using accuracy (Acc), sensitivity (Se), specificity (Sp), and area under the receiver operating characteristic curve (AUC-ROC) metrics. In most of the cases, the model accuracy was 0.8 or above for both training and test sets. Additionally, SHAP (SHapley Additive exPlanations) values were utilized for feature importance analysis, highlighting significant descriptors contributing to toxicity predictions. The models were ranked using the Sum of Ranking Differences (SRD) method to systematically select the most effective model. The optimal models demonstrated high predictive accuracy and interpretability, providing a scalable and efficient solution for toxicity assessment compared to traditional methods. This approach offers a valuable tool for rapidly screening potentially hazardous chemicals in plastic packaging.
Collapse
Affiliation(s)
- Md Mobarak Hossain
- Drug Theoretics and Cheminformatics (DTC) Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700032, India
| | - Kunal Roy
- Drug Theoretics and Cheminformatics (DTC) Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700032, India.
| |
Collapse
|
4
|
Andrić F, Imamoto M, Jankov M. Implementation of multiobjective decision-making algorithms and image analysis in HPTLC-guided extraction optimization of natural products. J Chromatogr A 2024; 1737:465443. [PMID: 39490194 DOI: 10.1016/j.chroma.2024.465443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Revised: 10/13/2024] [Accepted: 10/15/2024] [Indexed: 11/05/2024]
Abstract
A new, efficient, and low-cost approach for monitoring extraction optimization was proposed based on high-performance thin-layer chromatography (HPTLC) coupled with digital image analysis. Since HPTLC produces rich chromatographic signals corresponding to various analytes which may be differently affected by extraction conditions, four multicriteria decision-making (MCDM) techniques were compared for their ability to aggregate multiple chromatographic responses: Derringer's desirability approach, Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS), Preference Ranking Organization Method for Enrichment Evaluations (PROMETHEE-2), and the Sum of ranking differences (SRD). Ultrasound-assisted extraction (UAE) of green tea leaves with ethanol-water mixtures was used as a model system. The amount of ethanol and extraction time were varied according to the central composite design. Ranking eleven extracts by Derringer's desirability approach, TOPSIS, and PROMETHEE-2 showed the same results. SRD analysis yielded slightly different results from previous methods. Response surface models (RSM) based on the previous three MCDM approaches demonstrated that extraction conditions with moderate amounts of ethanol (73%) and extraction times (46 min) lead to optimal chromatographic profiles. RSM optimization performed on individual peaks, tentatively corresponding to rutin, chlorophyll, and gallic acid, led to different results, which justified the use of MCDM algorithms for aggregation of multiple responses. Aside from natural products, the proposed approach has the potential to be implemented in various extraction optimizations.
Collapse
Affiliation(s)
- Filip Andrić
- University of Belgrade - Faculty of Chemistry, Studentski trg 12-16, 1100 Belgrade, Serbia.
| | - Minami Imamoto
- Department of Life Science and Technology, School of Life Science and Technology, Tokyo Institute of Technology, Tokyo, Japan
| | - Milica Jankov
- Innovative Centre of the Faculty of Chemistry Ltd., Studentski trg 12-16, 1100 Belgrade, Serbia
| |
Collapse
|
5
|
Multiobject Optimization of National Football League Drafts: Comparison of Teams and Experts. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12136303] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]
Abstract
Predicting the success of National Football League drafts has always been an exciting issue for the teams, fans and even for scientists. Among the numerous approaches, one of the best techniques is to ask the opinion of sport experts, who have the knowledge and past experiences to rate the drafts of the teams. When asking a set of sport experts to evaluate the performances of teams, a multicriteria decision making problem arises unavoidably. The current paper uses the draft evaluations of the 32 NFL teams given by 18 experts: a novel multicriteria decision making tool has been applied: the sum of ranking differences (SRD). We introduce a quick and easy-to-follow approach on how to evaluate the performance of the teams and the experts at the same time. Our results on the 2021 NFL draft data indicate that Green Bay Packers has the most promising drafts for 2021, while the experts have been grouped into three distinct groups based on the distance to the hypothetical best evaluation. Even the coding options can be tailored according to the experts’ opinions. Statistically correct (pairwise or group) comparisons can be made using analysis of variance (ANOVA). A comparison to TOPSIS ranking revealed that SRD gives a more objective ranking due to the lack of predefined weights.
Collapse
|
6
|
Extended continuous similarity indices: theory and application for QSAR descriptor selection. J Comput Aided Mol Des 2022; 36:157-173. [PMID: 35288838 DOI: 10.1007/s10822-022-00444-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 02/23/2022] [Indexed: 01/10/2023]
Abstract
Extended (or n-ary) similarity indices have been recently proposed to extend the comparative analysis of binary strings. Going beyond the traditional notion of pairwise comparisons, these novel indices allow comparing any number of objects at the same time. This results in a remarkable efficiency gain with respect to other approaches, since now we can compare N molecules in O(N) instead of the common quadratic O(N2) timescale. This favorable scaling has motivated the application of these indices to diversity selection, clustering, phylogenetic analysis, chemical space visualization, and post-processing of molecular dynamics simulations. However, the current formulation of the n-ary indices is limited to vectors with binary or categorical inputs. Here, we present the further generalization of this formalism so it can be applied to numerical data, i.e. to vectors with continuous components. We discuss several ways to achieve this extension and present their analytical properties. As a practical example, we apply this formalism to the problem of feature selection in QSAR and prove that the extended continuous similarity indices provide a convenient way to discern between several sets of descriptors.
Collapse
|
7
|
Prioritizing Post-Disaster Reconstruction Projects Using an Integrated Multi-Criteria Decision-Making Approach: A Case Study. BUILDINGS 2022. [DOI: 10.3390/buildings12020136] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
As the destructive impacts of both human-made and natural disasters on societies and built environments are predicted to increase in the future, innovative disaster management strategies to cope with emergency conditions are becoming more crucial. After a disaster, selecting the most critical post-disaster reconstruction projects among available projects is a challenging decision due to resource constraints. There is strong evidence that the success of many post-disaster reconstruction projects is compromised by inappropriate decisions when choosing the most critical projects. Therefore, this study presents an integrated approach based on four multi-criteria decision-making (MCDM) techniques, namely, TOPSIS, ELECTRE III, VIKOR, and PROMETHEE, to aid decision makers in prioritizing post-disaster projects. Furthermore, an aggregation approach (linear assignment) is used to generate the final ranking vector since various methods may provide different outcomes. In the first stage, 21 criteria were determined based on sustainability. To validate the performance of the proposed approach, the obtained results were compared to the results of an artificial neural network (ANN) algorithm, which was applied to predict the projects’ success rates. A case study was used to assess the application of the proposed model. The obtained results show that in the selected case, the most critical criteria in post-disaster project selection are quality, robustness, and customer satisfaction. The findings of this study can contribute to the growing body of knowledge about disaster management strategies and have implications for key stakeholders involved in post-disaster reconstruction projects. Furthermore, this study provides valuable information for national decision makers in countries that have limited experience with disasters and where the destructive consequences of disasters on the built environment are increasing.
Collapse
|
8
|
Karwowska M, Kononiuk AD. Effect of nitrate reduction and storage time on the antioxidative properties, biogenic amines and amino acid profile of dry fermented loins. Int J Food Sci Technol 2021. [DOI: 10.1111/ijfs.15373] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Małgorzata Karwowska
- Department of Animal Raw Materials Technology Faculty of Food Science and Biotechnology University of Life Sciences in Lublin Skromna 8 Lublin 20‐704 Poland
| | - Anna D. Kononiuk
- Department of Animal Raw Materials Technology Faculty of Food Science and Biotechnology University of Life Sciences in Lublin Skromna 8 Lublin 20‐704 Poland
- Institute of Animal Reproduction and Food Research Polish Academy of Sciences ul Tuwima 10 Olsztyn 10‐748 Poland
| |
Collapse
|
9
|
Radványi D, Szelényi M, Gere A, Molnár BP. From Sampling to Analysis: How to Achieve the Best Sample Throughput via Sampling Optimization and Relevant Compound Analysis Using Sum of Ranking Differences Method? Foods 2021; 10:foods10112681. [PMID: 34828965 PMCID: PMC8624423 DOI: 10.3390/foods10112681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Revised: 10/25/2021] [Accepted: 10/26/2021] [Indexed: 12/05/2022] Open
Abstract
The determination of an optimal volatile sampling procedure is always a key question in analytical chemistry. In this paper, we introduce the application of a novel non-parametric statistical method, the sum of ranking differences (SRD), for the quick and efficient determination of optimal sampling procedures. Different types of adsorbents (Porapak Q, HayeSep Q, and Carbotrap) and sampling times (1, 2, 4, and 6 h) were used for volatile collections of lettuce (Lactuca sativa) samples. SRD identified 6 h samplings as the optimal procedure. However, 1 or 4 h sampling with HayeSep Q and 2 h sampling with Carbotrap are still efficient enough if the aim is to reduce sampling time. Based on our results, SRD provides a novel way to not only highlight an optimal sampling procedure but also decrease evaluation time.
Collapse
Affiliation(s)
- Dalma Radványi
- Institute of Food Science and Technology, Hungarian University of Agriculture and Life Sciences, Villányi út 29-43, H-1118 Budapest, Hungary;
| | - Magdolna Szelényi
- Plant Protection Institute, Eötvös Loránd Research Network, Brunszvik u. 2, H-2462 Martonvásár, Hungary; (M.S.); (B.P.M.)
| | - Attila Gere
- Institute of Food Science and Technology, Hungarian University of Agriculture and Life Sciences, Villányi út 29-43, H-1118 Budapest, Hungary;
- Correspondence: or
| | - Béla Péter Molnár
- Plant Protection Institute, Eötvös Loránd Research Network, Brunszvik u. 2, H-2462 Martonvásár, Hungary; (M.S.); (B.P.M.)
| |
Collapse
|
10
|
Bajusz D, Miranda-Quintana RA, Rácz A, Héberger K. Extended many-item similarity indices for sets of nucleotide and protein sequences. Comput Struct Biotechnol J 2021; 19:3628-3639. [PMID: 34257841 PMCID: PMC8253954 DOI: 10.1016/j.csbj.2021.06.021] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 06/07/2021] [Accepted: 06/14/2021] [Indexed: 12/16/2022] Open
Abstract
Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints (i.e., direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two (t) possible items (e.g. DNA/RNA sequences with t = 4, or protein sequences with t = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons.
Collapse
Affiliation(s)
- Dávid Bajusz
- Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
| | | | - Anita Rácz
- Plasma Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
| | - Károly Héberger
- Plasma Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
| |
Collapse
|
11
|
Chemometrics for Selection, Prediction, and Classification of Sustainable Solutions for Green Chemistry—A Review. Symmetry (Basel) 2020. [DOI: 10.3390/sym12122055] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
In this review, we present the applications of chemometric techniques for green and sustainable chemistry. The techniques, such as cluster analysis, principal component analysis, artificial neural networks, and multivariate ranking techniques, are applied for dealing with missing data, grouping or classification purposes, selection of green material, or processes. The areas of application are mainly finding sustainable solutions in terms of solvents, reagents, processes, or conditions of processes. Another important area is filling the data gaps in datasets to more fully characterize sustainable options. It is significant as many experiments are avoided, and the results are obtained with good approximation. Multivariate statistics are tools that support the application of quantitative structure–property relationships, a widely applied technique in green chemistry.
Collapse
|