1
|
Queen KJ, Barrett M, Millstein J. Super Partition: fast, flexible, and interpretable large-scale data reduction in R. PeerJ 2025; 13:e18580. [PMID: 39886016 PMCID: PMC11781262 DOI: 10.7717/peerj.18580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 11/04/2024] [Indexed: 02/01/2025] Open
Abstract
Motivation As data sets increase in size and complexity with advancing technology, flexible and interpretable data reduction methods that quantify information preservation become increasingly important. Results Super Partition is a large-scale approximation of the original Partition data reduction algorithm that allows the user to flexibly specify the minimum amount of information captured for each input feature. In an initial step, Genie, a fast, hierarchical clustering algorithm, forms a super-partition, thereby increasing the computational tractability by allowing Partition to be applied to the subsets. Applications to high dimensional data sets show scalability to hundreds of thousands of features with reasonable computation times. Availability and implementation Super Partition is a new function within the partition R package, available on the CRAN repository (https://cran.r-project.org/web/packages/partition/index.html).
Collapse
Affiliation(s)
- Katelyn J. Queen
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, United States
| | - Malcolm Barrett
- Department of Health Policy, Stanford University, Stanford, California, United States
| | - Joshua Millstein
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, United States
| |
Collapse
|
2
|
Du H, Lu D, Wang Z, Ma C, Shi X, Wang X. Fast clustering algorithm based on MST of representative points. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:15830-15858. [PMID: 37919991 DOI: 10.3934/mbe.2023705] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/04/2023]
Abstract
Minimum spanning tree (MST)-based clustering algorithms are widely used to detect clusters with diverse densities and irregular shapes. However, most algorithms require the entire dataset to construct an MST, which leads to significant computational overhead. To alleviate this issue, our proposed algorithm R-MST utilizes representative points instead of all sample points for constructing MST. Additionally, based on the density and nearest neighbor distance, we improved the representative point selection strategy to enhance the uniform distribution of representative points in sparse areas, enabling the algorithm to perform well on datasets with varying densities. Furthermore, traditional methods for eliminating inconsistent edges generally require prior knowledge about the number of clusters, which is not always readily available in practical applications. Therefore, we propose an adaptive method that employs mutual neighbors to identify inconsistent edges and determine the optimal number of clusters automatically. The experimental results indicate that the R-MST algorithm not only improves the efficiency of clustering but also enhances its accuracy.
Collapse
Affiliation(s)
- Hui Du
- The School of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
| | - Depeng Lu
- The School of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
| | - Zhihe Wang
- The School of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
| | - Cuntao Ma
- The School of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
| | - Xinxin Shi
- The School of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
| | - Xiaoli Wang
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| |
Collapse
|
3
|
Kowalski PA, Jeczmionek E. Parallel complete gradient clustering algorithm and its properties. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.03.087] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
4
|
Chiu AM, Molloy EK, Tan Z, Talwalkar A, Sankararaman S. Inferring population structure in biobank-scale genomic data. Am J Hum Genet 2022; 109:727-737. [PMID: 35298920 PMCID: PMC9069078 DOI: 10.1016/j.ajhg.2022.02.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Accepted: 02/21/2022] [Indexed: 01/07/2023] Open
Abstract
Inferring the structure of human populations from genetic variation data is a key task in population and medical genomic studies. Although a number of methods for population structure inference have been proposed, current methods are impractical to run on biobank-scale genomic datasets containing millions of individuals and genetic variants. We introduce SCOPE, a method for population structure inference that is orders of magnitude faster than existing methods while achieving comparable accuracy. SCOPE infers population structure in about a day on a dataset containing one million individuals and variants as well as on the UK Biobank dataset containing 488,363 individuals and 569,346 variants. Furthermore, SCOPE can leverage allele frequencies from previous studies to improve the interpretability of population structure estimates.
Collapse
Affiliation(s)
- Alec M Chiu
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Erin K Molloy
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA; Institute for Advanced Computer Studies, University of Maryland, College Park, College Park, MD 20742, USA
| | - Zilong Tan
- Facebook, Inc., Menlo Park, CA 94025, USA
| | - Ameet Talwalkar
- Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Sriram Sankararaman
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
5
|
Anchang B, Mendez-Giraldez R, Xu X, Archer TK, Chen Q, Hu G, Plevritis SK, Motsinger-Reif AA, Li JL. Visualization, benchmarking and characterization of nested single-cell heterogeneity as dynamic forest mixtures. Brief Bioinform 2022; 23:6534382. [PMID: 35192692 PMCID: PMC8921621 DOI: 10.1093/bib/bbac017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 11/19/2021] [Accepted: 01/13/2022] [Indexed: 11/13/2022] Open
Abstract
A major topic of debate in developmental biology centers on whether development is continuous, discontinuous, or a mixture of both. Pseudo-time trajectory models, optimal for visualizing cellular progression, model cell transitions as continuous state manifolds and do not explicitly model real-time, complex, heterogeneous systems and are challenging for benchmarking with temporal models. We present a data-driven framework that addresses these limitations with temporal single-cell data collected at discrete time points as inputs and a mixture of dependent minimum spanning trees (MSTs) as outputs, denoted as dynamic spanning forest mixtures (DSFMix). DSFMix uses decision-tree models to select genes that account for variations in multimodality, skewness and time. The genes are subsequently used to build the forest using tree agglomerative hierarchical clustering and dynamic branch cutting. We first motivate the use of forest-based algorithms compared to single-tree approaches for visualizing and characterizing developmental processes. We next benchmark DSFMix to pseudo-time and temporal approaches in terms of feature selection, time correlation, and network similarity. Finally, we demonstrate how DSFMix can be used to visualize, compare and characterize complex relationships during biological processes such as epithelial-mesenchymal transition, spermatogenesis, stem cell pluripotency, early transcriptional response from hormones and immune response to coronavirus disease. Our results indicate that the expression of genes during normal development exhibits a high proportion of non-uniformly distributed profiles that are mostly right-skewed and multimodal; the latter being a characteristic of major steady states during development. Our study also identifies and validates gene signatures driving complex dynamic processes during somatic or germline differentiation.
Collapse
Affiliation(s)
- Benedict Anchang
- Corresponding author: Benedict Anchang, Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences. 111 T W Alexander Dr, Research Triangle Park, NC 27709, USA and Center for Cancer Research, National Cancer Institute, Bethesda, MD 20892, USA. Tel +1 984-287-3350; E-mail:
| | - Raul Mendez-Giraldez
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Xiaojiang Xu
- Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Trevor K Archer
- Epigenetics & Stem Cell Biology Laboratory/Chromatin & Gene Expression Group, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Qing Chen
- Epigenetics & Stem Cell Biology Laboratory/Chromatin & Gene Expression Group, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Guang Hu
- Epigenetics & Stem Cell Biology Laboratory/Chromatin & Gene Expression Group, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Sylvia K Plevritis
- Department of Biomedical Data Science, Center for Cancer Systems Biology, Stanford University, Stanford, California, USA
| | - Alison Anne Motsinger-Reif
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Stanford, California, USA
| | - Jian-Liang Li
- Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, Stanford, California, USA
| |
Collapse
|
6
|
Gagolewski M, Bartoszuk M, Cena A. Are cluster validity measures (in) valid? Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.10.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
7
|
Mehta V, Bawa S, Singh J. WEClustering: word embeddings based text clustering technique for large datasets. COMPLEX INTELL SYST 2021; 7:3211-3224. [PMID: 34777978 PMCID: PMC8421191 DOI: 10.1007/s40747-021-00512-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 08/14/2021] [Indexed: 11/24/2022]
Abstract
A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named “Bidirectional Encoders Representations using Transformers”. The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.
Collapse
Affiliation(s)
- Vivek Mehta
- Computer Science and Engineering Department, Thapar Institute of Engineering and Technology, Patiala, Punjab 147001 India
| | - Seema Bawa
- Computer Science and Engineering Department, Thapar Institute of Engineering and Technology, Patiala, Punjab 147001 India
| | - Jasmeet Singh
- Computer Science and Engineering Department, Thapar Institute of Engineering and Technology, Patiala, Punjab 147001 India
| |
Collapse
|
8
|
Yan S, Zhang M, Lai S, Liu Y, Peng Y. Image retrieval for Structure-from-Motion via Graph Convolutional Network. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.05.050] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
9
|
Chang CY, Lee SJ, Wu CH, Liu CF, Liu CK. Using word semantic concepts for plagiarism detection in text documents. INFORM RETRIEVAL J 2021. [DOI: 10.1007/s10791-021-09394-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
10
|
A Hybrid Model to Classify Patients with Chronic Obstructive Respiratory Diseases. J Med Syst 2021; 45:31. [PMID: 33517504 PMCID: PMC7847234 DOI: 10.1007/s10916-020-01704-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Accepted: 12/27/2020] [Indexed: 11/05/2022]
Abstract
Over the last decades, an increase in the ageing population and age-related diseases has been observed, with the increase in healthcare costs. As so, new solutions to provide more efficient and affordable support to this group of patients are needed. Such solutions should never discard the user and instead should focus on promoting more healthy lifestyles and provide tools for patients’ active participation in the treatment and management of their diseases. In this concern, the Personal Health Empowerment (PHE) project presented in this paper aims to empower patients to monitor and improve their health, using personal data and technology assisted coaching. The work described in this paper focuses on defining an approach for user modelling on patients with chronic obstructive respiratory diseases using a hybrid modelling approach to identify different groups of users. A classification model with 90.4% prediction accuracy was generated combining agglomerative hierarchical clustering and decision tree classification techniques. Furthermore, this model identified 5 clusters which describe characteristics of 5 different types of users according to 7 generated rules. With the modelling approach defined in this study, a personalized coaching solution will be built considering patients with different necessities and capabilities and adapting the support provided, enabling the recognition of early signs of exacerbations and objective self-monitoring and treatment of the disease. The novel factor of this approach resides in the possibility to integrate personalized coaching technologies adapted to each kind of user within a smartphone-based application resulting in a reliable and affordable alternative for patients to manage their disease.
Collapse
|
11
|
|
12
|
Weighted z-Distance-Based Clustering and Its Application to Time-Series Data. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9245469] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Clustering is the practice of dividing given data into similar groups and is one of the most widely used methods for unsupervised learning. Lee and Ouyang proposed a self-constructing clustering (SCC) method in which the similarity threshold, instead of the number of clusters, is specified in advance by the user. For a given set of instances, SCC performs only one training cycle on those instances. Once an instance has been assigned to a cluster, the assignment will not be changed afterwards. The clusters produced may depend on the order in which the instances are considered, and assignment errors are more likely to occur. Also, all dimensions are equally weighted, which may not be suitable in certain applications, e.g., time-series clustering. In this paper, improvements are proposed. Two or more training cycles on the instances are performed. An instance can be re-assigned to another cluster in each cycle. In this way, the clusters produced are less likely to be affected by the feeding order of the instances. Also, each dimension of the input can be weighted differently in the clustering process. The values of the weights are adaptively learned from the data. A number of experiments with real-world benchmark datasets are conducted and the results are shown to demonstrate the effectiveness of the proposed ideas.
Collapse
|
13
|
Yin D, Motohashi K, Dang J. Large-scale name disambiguation of Chinese patent inventors (1985–2016). Scientometrics 2019. [DOI: 10.1007/s11192-019-03310-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
14
|
Bougnom BP, McNally A, Etoa FX, Piddock LJ. Antibiotic resistance genes are abundant and diverse in raw sewage used for urban agriculture in Africa and associated with urban population density. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2019; 251:146-154. [PMID: 31078086 DOI: 10.1016/j.envpol.2019.04.056] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Revised: 04/08/2019] [Accepted: 04/10/2019] [Indexed: 06/09/2023]
Abstract
A comparative study was conducted to (1) assess the potential of raw sewage used for urban agriculture to disseminate bacterial resistance in two cities of different size in Cameroon (Central Africa) and (2) compare the outcome with data obtained in Burkina Faso (West Africa). In each city, raw sewage samples were sampled from open-air canals in three neighbourhoods. After DNA extraction, the microbial population structure and function, presence of pathogens, antibiotic resistance genes and Enterobacteriaceae plasmids replicons were analysed using whole genome shotgun sequencing and bioinformatics. Forty-three pathogen-specific virulenc e factor genes were detected in the sewage. Eighteen different incompatibility groups of Enterobacteriaceae plasmid replicon types (ColE, A/C, B/O/K/Z, FIA, FIB, FIC, FII, H, I, N, P, Q, R, T, U, W, X, and Y) implicated in the spread of drug-resistance genes were present in the sewage samples. One hundred thirty-six antibiotic resistance genes commonly associated with MDR plasmid carriage were identified in both cities. Enterobacteriaceae plasmid replicons and ARGs found in Burkina Faso wastewaters were also present in Cameroon waters. The abundance of Enterobacteriaceae, plasmid replicons and antibiotic resistance genes was greater in Yaounde, the city with the greater population. In conclusion, the clinically relevant environmental resistome found in raw sewage used for urban agriculture is common in West and Central Africa. The size of the city impacts on the abundance of drug-resistant genes in the raw sewage while ESBL gene abundance is related to the prevalence of Enterobacteriaceae along with plasmid Enterobacteriaceae abundance associated to faecal pollution.
Collapse
Affiliation(s)
- Blaise P Bougnom
- Institute of Microbiology and Infection, University of Birmingham, B15 2TT, UK; Department of Microbiology, Faculty of Science, University of Yaounde 1, P.O. Box, 812, Yaounde, Cameroon
| | - Alan McNally
- Institute of Microbiology and Infection, University of Birmingham, B15 2TT, UK
| | - François-X Etoa
- Department of Microbiology, Faculty of Science, University of Yaounde 1, P.O. Box, 812, Yaounde, Cameroon
| | - Laura Jv Piddock
- Institute of Microbiology and Infection, University of Birmingham, B15 2TT, UK.
| |
Collapse
|
15
|
Xu G, Yang M, Wu Q. Sparse subspace clustering with low-rank transformation. Neural Comput Appl 2019. [DOI: 10.1007/s00521-017-3259-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
16
|
Tonkin-Hill G, Lees JA, Bentley SD, Frost SDW, Corander J. Fast hierarchical Bayesian analysis of population structure. Nucleic Acids Res 2019; 47:5539-5549. [PMID: 31076776 PMCID: PMC6582336 DOI: 10.1093/nar/gkz361] [Citation(s) in RCA: 158] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 04/29/2019] [Indexed: 12/16/2022] Open
Abstract
We present fastbaps, a fast solution to the genetic clustering problem. Fastbaps rapidly identifies an approximate fit to a Dirichlet process mixture model (DPM) for clustering multilocus genotype data. Our efficient model-based clustering approach is able to cluster datasets 10-100 times larger than the existing model-based methods, which we demonstrate by analyzing an alignment of over 110 000 sequences of HIV-1 pol genes. We also provide a method for rapidly partitioning an existing hierarchy in order to maximize the DPM model marginal likelihood, allowing us to split phylogenetic trees into clades and subclades using a population genomic model. Extensive tests on simulated data as well as a diverse set of real bacterial and viral datasets show that fastbaps provides comparable or improved solutions to previous model-based methods, while being significantly faster. The method is made freely available under an open source MIT licence as an easy to use R package at https://github.com/gtonkinhill/fastbaps.
Collapse
Affiliation(s)
- Gerry Tonkin-Hill
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - John A Lees
- Department of Microbiology, New York University School of Medicine, NY 10016, USA
| | - Stephen D Bentley
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Simon D W Frost
- Department of Veterinary Medicine, University of Cambridge, Cambridge, CB3 0ES, UK
- The Alan Turing Institute, London, NW1 2DB, UK
| | - Jukka Corander
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
- Department of Biostatistics, University of Oslo, Blindern 0317, Norway
- Helsinki Institute for Information Technology HIIT, Department of Mathematics and Statistics, University of Helsinki, Aalto FI-00076, Finland
| |
Collapse
|
17
|
Goyal P, Kumari S, Sharma S, Balasubramaniam S, Goyal N. Parallel SLINK for big data. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2019. [DOI: 10.1007/s41060-019-00188-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
18
|
Ciza PH, Sacre PY, Waffo C, Coïc L, Avohou H, Mbinze JK, Ngono R, Marini RD, Hubert P, Ziemons E. Comparing the qualitative performances of handheld NIR and Raman spectrophotometers for the detection of falsified pharmaceutical products. Talanta 2019; 202:469-478. [PMID: 31171209 DOI: 10.1016/j.talanta.2019.04.049] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Revised: 04/15/2019] [Accepted: 04/19/2019] [Indexed: 12/16/2022]
Abstract
Over the last decade, the growth of the global pharmaceutical market has led to an overall increase of substandard and falsified drugs especially on the African market (or emerging countries). Recently, several methods using handheld/portable vibrational spectroscopy have been developed for rapid and on-field drug analysis. The objective of this work was to evaluate the performances of various NIR and Raman handheld spectrophotometers in specific brand identification of medicines through their primary packaging. Three groups of drug samples (artemether-lumefantrine, paracetamol and ibuprofen) were used in tablet or capsule forms. In order to perform a critical comparison, the analytical performances of the two analytical systems were compared statistically using three methods: hierarchical clustering algorithm (HCA), data-driven soft independent modelling of class analogy (DD-SIMCA) and hit quality index (HQI). The overall results show good detection abilities for NIR systems compared to Raman systems based on Matthews's correlation coefficients, generally close to one. Raman systems are less sensitive to the physical state of the samples than the NIR systems, it also suffers of the auto-fluorescence phenomenon and the signal of highly dosed active pharmaceutical ingredient (e.g. paracetamol or lumefantrine) may mask the signal of low-dosed and weaker Raman active compounds (e.g. artemether). Hence, Raman systems are less effective for specific product identification purposes but are interesting in the context of falsification because they allow a visual interpretation of the spectral signature (presence or absence of API).
Collapse
Affiliation(s)
- P H Ciza
- University of Liege (ULiege), CIRM, VibraSante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium; University of Kinshasa, Faculty of Pharmaceutical Sciences, LACOMEDA, Lemba, 212 Kinshasa XI, Democratic Republic of Congo
| | - P-Y Sacre
- University of Liege (ULiege), CIRM, VibraSante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium.
| | - C Waffo
- University of Liege (ULiege), CIRM, VibraSante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium; University of Yaoundé I, Faculty of Medicine and Biomedical Sciences and National Drug Control and Valuation (LANACOME), Cameroon
| | - L Coïc
- University of Liege (ULiege), CIRM, VibraSante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium
| | - H Avohou
- University of Liege (ULiege), CIRM, VibraSante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium
| | - J K Mbinze
- University of Kinshasa, Faculty of Pharmaceutical Sciences, LACOMEDA, Lemba, 212 Kinshasa XI, Democratic Republic of Congo
| | - R Ngono
- University of Yaoundé I, Faculty of Medicine and Biomedical Sciences and National Drug Control and Valuation (LANACOME), Cameroon
| | - R D Marini
- University of Liege (ULiege), CIRM, VibraSante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium
| | - Ph Hubert
- University of Liege (ULiege), CIRM, VibraSante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium
| | - E Ziemons
- University of Liege (ULiege), CIRM, VibraSante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium
| |
Collapse
|
19
|
Loslever P, Guidini Gonçalves T, de Oliveira KM, Kolski C. Using fuzzy coding with qualitative data: example with subjective data in human-computer interaction. THEORETICAL ISSUES IN ERGONOMICS SCIENCE 2019. [DOI: 10.1080/1463922x.2019.1574932] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Pierre Loslever
- LAMIH CNRS UMR 8201, Université Polytechnique Hauts-de-France, Valenciennes, France
| | | | | | - Christophe Kolski
- LAMIH CNRS UMR 8201, Université Polytechnique Hauts-de-France, Valenciennes, France
| |
Collapse
|
20
|
|
21
|
|
22
|
Beliakov G, Gagolewski M, James S. Penalty-Based and Other Representations of Economic Inequality. INT J UNCERTAIN FUZZ 2016. [DOI: 10.1142/s0218488516400018] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Economic inequality measures are employed as a key component in various socio-demographic indices to capture the disparity between the wealthy and poor. Since their inception, they have also been used as a basis for modelling spread and disparity in other contexts. While recent research has identified that a number of classical inequality and welfare functions can be considered in the framework of OWA operators, here we propose a framework of penalty-based aggregation functions and their associated penalties as measures of inequality.
Collapse
Affiliation(s)
- Gleb Beliakov
- School of Information Technology, Deakin University, 221 Burwood Hwy, Burwood, Victoria 3125, Australia
| | - Marek Gagolewski
- Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447 Warsaw, Poland
- Faculty of Mathematics and Information Science, Warsaw University of Technology, ul. Koszykowa 75, 00-662 Warsaw, Poland
| | - Simon James
- School of Information Technology, Deakin University, 221 Burwood Hwy, Burwood, Victoria 3125, Australia
| |
Collapse
|
23
|
Szilágyi L, Szilágyi SM. A modified two-stage Markov clustering algorithm for large and sparse networks. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2016; 135:15-26. [PMID: 27586476 DOI: 10.1016/j.cmpb.2016.07.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2016] [Revised: 05/27/2016] [Accepted: 07/01/2016] [Indexed: 06/06/2023]
Abstract
BACKGROUND Graph-based hierarchical clustering algorithms become prohibitively costly in both execution time and storage space, as the number of nodes approaches the order of millions. OBJECTIVE A fast and highly memory efficient Markov clustering algorithm is proposed to perform the classification of huge sparse networks using an ordinary personal computer. METHODS Improvements compared to previous versions are achieved through adequately chosen data structures that facilitate the efficient handling of symmetric sparse matrices. Clustering is performed in two stages: the initial connected network is processed in a sparse matrix until it breaks into isolated, small, and relatively dense subgraphs, which are then processed separately until convergence is obtained. An intelligent stopping criterion is also proposed to quit further processing of a subgraph that tends toward completeness with equal edge weights. The main advantage of this algorithm is that the necessary number of iterations is separately decided for each graph node. RESULTS The proposed algorithm was tested using the SCOP95 and large synthetic protein sequence data sets. The validation process revealed that the proposed method can reduce 3-6 times the processing time of huge sequence networks compared to previous Markov clustering solutions, without losing anything from the partition quality. CONCLUSIONS A one-million-node and one-billion-edge protein sequence network defined by a BLAST similarity matrix can be processed with an upper-class personal computer in 100 minutes. Further improvement in speed is possible via parallel data processing, while the extension toward several million nodes needs intermediary data storage, for example on solid state drives.
Collapse
Affiliation(s)
- László Szilágyi
- Faculty of Technical and Human Sciences, Sapientia University of Transylvania,Şoseaua Sighişoarei 1/C, 540485 Tîrgu Mureş, Romania; Department of Informatics, Petru Maior University, Str. N. Iorga Nr. 1, 540088 Tîrgu Mureş, Romania
| | - Sándor M Szilágyi
- Budapest University of Technology and Economics, Department of Control Engineering and Information Technology, Magyar tudósok krt. 2, H-1117 Budapest, Hungary; Department of Informatics, Petru Maior University, Str. N. Iorga Nr. 1, 540088 Tîrgu Mureş, Romania.
| |
Collapse
|