1
|
Data Resource Profile: The Cancer Public Library Database in South Korea. Cancer Res Treat 2024:crt.2024.207. [PMID: 38697846 DOI: 10.4143/crt.2024.207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Accepted: 04/30/2024] [Indexed: 05/05/2024] Open
Abstract
This paper provides a comprehensive overview of the Cancer Public Library Database (CPLD), established under the Korean Clinical Data Utilization for Research Excellence project (K-CURE). The CPLD links data from four major population-based public sources: the Korea National Cancer Incidence Database in the Korea Central Cancer Registry, cause-of-death data in Statistics Korea, the National Health Information Database in the National Health Insurance Service, and the National Health Insurance Research Database in the Health Insurance Review & Assessment Service. These databases are linked using an encrypted resident registration number. The CPLD, established in 2022 and updated annually, comprises 1,983,499 men and women newly diagnosed with cancer between 2012 and 2019. It contains data on cancer registration and death, demographics, medical claims, general health checkups, and national cancer screening. The most common cancers among men in the CPLD were stomach (16.1%), lung (14.0%), colorectal (13.3%), prostate (9.6%), and liver (9.3%) cancers. The most common cancers among women were thyroid (20.4%), breast (16.6%), colorectal (9.0%), stomach (7.8%), and lung (6.2%) cancers. Among them, 571,285 died between 2012 and 2020 owing to cancer (89.2%) or other causes (10.8%). Upon approval, the CPLD is accessible to researchers through the K-CURE portal. The CPLD is a unique resource for diverse cancer research to investigate medical use before a cancer diagnosis, during initial diagnosis and treatment, and long-term follow-up. This offers expanded insight into healthcare delivery across the cancer continuum, from screening to end-of-life care.
Collapse
|
2
|
Open-source intelligence: a comprehensive review of the current state, applications and future perspectives in cyber security. Artif Intell Rev 2023; 56:1-32. [PMID: 37362900 PMCID: PMC10014398 DOI: 10.1007/s10462-023-10454-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2023]
Abstract
The volume of data generated by today's digitally connected world is enormous, and a significant portion of it is publicly available. These data sources are web archives, public databases, and social networks such as Facebook, Twitter, LinkedIn, Emails, Telegrams, etc. Open-source intelligence (OSINT) extracts information from a collection of publicly available and accessible data. OSINT can provide a solution to the challenges in extracting and gathering intelligence from various publicly available information and social networks. OSINT is currently expanding at an incredible rate, bringing new artificial intelligence-based approaches to address issues of national security, political campaign, the cyber industry, criminal profiling, and society, as well as cyber threats and crimes. In this paper, we have described the current state of OSINT tools/techniques and the state of the art for various applications of OSINT in cyber security. In addition, we have discussed the challenges and future directions to develop autonomous models. These models can provide solutions for different social network-based security, digital forensics, and cyber crime-based problems using various machine learning (ML), deep learning (DL) and artificial intelligence (AI) with OSINT.
Collapse
|
3
|
The integration of large-scale public data and network analysis uncovers molecular characteristics of psoriasis. Hum Genomics 2022; 16:62. [PMID: 36437479 PMCID: PMC9703794 DOI: 10.1186/s40246-022-00431-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 11/07/2022] [Indexed: 11/29/2022] Open
Abstract
In recent years, a growing interest in the characterization of the molecular basis of psoriasis has been observed. However, despite the availability of a large amount of molecular data, many pathogenic mechanisms of psoriasis are still poorly understood. In this study, we performed an integrated analysis of 23 public transcriptomic datasets encompassing both lesional and uninvolved skin samples from psoriasis patients. We defined comprehensive gene co-expression network models of psoriatic lesions and uninvolved skin. Moreover, we curated and exploited a wide range of functional information from multiple public sources in order to systematically annotate the inferred networks. The integrated analysis of transcriptomics data and co-expression networks highlighted genes that are frequently dysregulated and show aberrant patterns of connectivity in the psoriatic lesion compared with the unaffected skin. Our approach allowed us to also identify plausible, previously unknown, actors in the expression of the psoriasis phenotype. Finally, we characterized communities of co-expressed genes associated with relevant molecular functions and expression signatures of specific immune cell types associated with the psoriasis lesion. Overall, integrating experimental driven results with curated functional information from public repositories represents an efficient approach to empower knowledge generation about psoriasis and may be applicable to other complex diseases.
Collapse
|
4
|
A full-view management method based on artificial neural networks for energy and material-savings in wastewater treatment plants. ENVIRONMENTAL RESEARCH 2022; 211:113054. [PMID: 35276189 DOI: 10.1016/j.envres.2022.113054] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 02/17/2022] [Accepted: 02/27/2022] [Indexed: 06/14/2023]
Abstract
Carbon neutrality has been received extensive attention in the field of wastewater treatment. The optimal management of wastewater treatment plants (WWTPs) has great significance and urgency since the serious energy and materials waste. In this study, a full-view management method based on artificial neural networks (ANNs) for energy and material savings in WWTPs was established. More than 5 years of historical operating data from two typical plants (size 40,000 t/d and 10,000 t/d) located in Chongqing, China, were obtained, and public data in the service area of each plant were systematically collected from open channels. These abundant historical and public data were used to train two ANNs (GRA-CNN-LSTM model and PCA-BPNN model) to predict the inlets/outlets wastewater quality and quantity. The overall average prediction accuracy of inlets/outlets wastewater indicators are greater than 92.60% and 93.76%, respectively. By combining the two models, more appropriate process operation strategies can be obtained 2 weeks in advance, with more than 11.20% and 16.91% reduction of energy and material costs, respectively. This proposed method can provide full-view decision support for the optimal management of WWTPs and is also expected to support carbon emission control and carbon neutrality in the field of wastewater treatment.
Collapse
|
5
|
Twenty Years of Addiction and Mental Illness in Alaska: Using the National Survey on Drug Use and Health to Understand Addiction in a Low Population and Rural State. J Community Health 2022; 47:680-686. [PMID: 35567711 DOI: 10.1007/s10900-022-01098-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/27/2022] [Indexed: 10/18/2022]
Abstract
Understanding changes in substance use in a small population state is challenging. Many national datasets restrict data to reduce the probability of identifying persons. Alaska is a small population state (731,000 residents) with a large geographic region (25% the size of the lower 48), a diverse population, and highly variable seasons, with fewer than 10% of the state being road accessible. Given the uniqueness of Alaska, this project sought to understand what could be learned about addiction and its relationships with unemployment and median income in Alaska. National Survey on Drug Use and Health, State and Small Area Estimates (1999-2020) data were analyzed to measure prevalence changes. Outcome prevalence were independently correlated with median income and annual unemployment rate as the annual collection periods varied. Analyses were limited to simple bivariate analyses due to the data restrictions. Median income was found to have stronger correlational relationships and significant relationships with more negative outcomes compared to unemployment. While annual unemployment rates had statistically significant relationships with substance use outcomes, negative mental health outcomes appeared more related to unemployment than median income. Alcohol use in the past month, cigarette and tobacco use, and pain reliever misuse declined while binge drinking in the past month and illicit drug use increased. More people reported depression, serious mental illness, and suicidal ideation and planning over time peaking in the last year of data collection. While NSDUH data provide some idea of the changes in drug use over time, their effectiveness in Alaska is unknown. Many data sources claim they are nationally representative, but these statements cannot be objectively measured. We will use these outcomes and data as a baseline for future studies where we will explore state specific data sources.
Collapse
|
6
|
The conspiracy of Covid-19 and 5G: Spatial analysis fallacies in the age of data democratization. Soc Sci Med 2021; 293:114546. [PMID: 34954674 PMCID: PMC8576388 DOI: 10.1016/j.socscimed.2021.114546] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Revised: 10/08/2021] [Accepted: 11/04/2021] [Indexed: 02/05/2023]
Abstract
In a context of mistrust in public health institutions and practices, anti-COVID/vaccination protests and the storming of Congress have illustrated that conspiracy theories are real and immanent threat to health and wellbeing, democracy, and public understanding of science. One manifestation of this is the suggested correlation of COVID-19 with 5G mobile technology. Throughout 2020, this alleged correlation was promoted and distributed widely on social media, often in the form of maps overlaying the distribution of COVID-19 cases with the instillation of 5G towers. These conspiracy theories are not fringe phenomena, and they form part of a growing repertoire for conspiracist activist groups with capacities for organised violence. In this paper, we outline how spatial data have been co-opted, and spatial correlations asserted by conspiracy theorists. We consider the basis of their claims of causal association with reference to three key areas of geographical explanation: (1) how social properties are constituted and how they exert complex causal forces, (2) the pitfalls of correlation with spatial and ecological data, and (3) the challenges of specifying and interpreting causal effects with spatial data. For each, we consider the unique theoretical and technical challenges involved in specifying meaningful correlation, and how their discarding facilitates conspiracist attribution. In doing so, we offer a basis both to interrogate conspiracists’ uses and interpretation of data from elementary principles and offer some cautionary notes on the potential for their future misuse in an age of data democratization. Finally, this paper contributes to work on the basis of conspiracy theories in general, by asserting how – absent an appreciation of these key methodological principles – spatial health data may be especially prone to co-option by conspiracist groups.
Collapse
|
7
|
Modeling the impact of exposure reductions using multi-stressor epidemiology, exposure models, and synthetic microdata: an application to birthweight in two environmental justice communities. JOURNAL OF EXPOSURE SCIENCE & ENVIRONMENTAL EPIDEMIOLOGY 2021; 31:442-453. [PMID: 33824415 PMCID: PMC8141037 DOI: 10.1038/s41370-021-00318-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Revised: 02/25/2021] [Accepted: 03/12/2021] [Indexed: 05/07/2023]
Abstract
BACKGROUND Many vulnerable populations experience elevated exposures to environmental and social stressors, with deleterious effects on health. Multi-stressor epidemiological models can be used to assess benefits of exposure reductions. However, requisite individual-level risk factor data are often unavailable at adequate spatial resolution. OBJECTIVE To leverage public data and novel simulation methods to estimate birthweight changes following simulated environmental interventions in two environmental justice communities in Massachusetts, USA. METHODS We gathered risk factor data from public sources (US Census, Behavioral Risk Factor Surveillance System, and Massachusetts Department of Health). We then created synthetic individual-level data sets using combinatorial optimization, and probabilistic and logistic modeling. Finally, we used coefficients from a multi-stressor epidemiological model to estimate birthweight and birthweight improvement associated with simulated environmental interventions. RESULTS We created geographically resolved synthetic microdata. Mothers with the lowest predicted birthweight were those identifying as Black or Hispanic, with parity > 1, utilization of government prenatal support, and lower educational attainment. Birthweight improvements following greenness and temperature improvements were similar for all high-risk groups and were larger than benefits from smoking cessation. SIGNIFICANCE Absent private health data, this methodology allows for assessment of cumulative risk and health inequities, and comparison of individual-level impacts of localized health interventions.
Collapse
|
8
|
Accessible molecular phylogenomics at no cost: obtaining 14 new mitogenomes for the ant subfamily Pseudomyrmecinae from public data. PeerJ 2019; 7:e6271. [PMID: 30697483 PMCID: PMC6348091 DOI: 10.7717/peerj.6271] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Accepted: 12/10/2018] [Indexed: 11/20/2022] Open
Abstract
The advent of Next Generation Sequencing has reduced sequencing costs and increased genomic projects from a huge amount of organismal taxa, generating an unprecedented amount of genomic datasets publicly available. Often, only a tiny fraction of outstanding relevance of the genomic data produced by researchers is used in their works. This fact allows the data generated to be recycled in further projects worldwide. The assembly of complete mitogenomes is frequently overlooked though it is useful to understand evolutionary relationships among taxa, especially those presenting poor mtDNA sampling at the level of genera and families. This is exactly the case for ants (Hymenoptera:Formicidae) and more specifically for the subfamily Pseudomyrmecinae, a group of arboreal ants with several cases of convergent coevolution without any complete mitochondrial sequence available. In this work, we assembled, annotated and performed comparative genomics analyses of 14 new complete mitochondria from Pseudomyrmecinae species relying solely on public datasets available from the Sequence Read Archive (SRA). We used all complete mitogenomes available for ants to study the gene order conservation and also to generate two phylogenetic trees using both (i) concatenated set of 13 mitochondrial genes and (ii) the whole mitochondrial sequences. Even though the tree topologies diverged subtly from each other (and from previous studies), our results confirm several known relationships and generate new evidences for sister clade classification inside Pseudomyrmecinae clade. We also performed a synteny analysis for Formicidae and identified possible sites in which nucleotidic insertions happened in mitogenomes of pseudomyrmecine ants. Using a data mining/bioinformatics approach, the current work increased the number of complete mitochondrial genomes available for ants from 15 to 29, demonstrating the unique potential of public databases for mitogenomics studies. The wide applications of mitogenomes in research and presence of mitochondrial data in different public dataset types makes the "no budget mitogenomics" approach ideal for comprehensive molecular studies, especially for subsampled taxa.
Collapse
|
9
|
Exploring the nature of prediagnostic blood transcriptome markers of chronic lymphocytic leukemia by assessing their overlap with the transcriptome at the clinical stage. BMC Genomics 2017; 18:239. [PMID: 28320322 PMCID: PMC5360061 DOI: 10.1186/s12864-017-3627-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2016] [Accepted: 03/14/2017] [Indexed: 11/10/2022] Open
Abstract
Background We recently identified 700 genes whose expression levels were predictive of chronic lymphocytic leukemia (CLL) in a genome-wide gene expression analysis of prediagnostic blood from future cases and matched controls. We hypothesized that a large fraction of these markers were likely related to early disease manifestations. Here we aim to gain a better understanding of the natural history of the identified markers by comparing results from our prediagnostic analysis, the only prediagnostic analysis to date, to results obtained from a meta-analysis of a series of publically available transcriptomics profiles obtained in incident CLL cases and controls. Results We observed considerable overlap between the results from our prediagnostic study and the clinical CLL signals (p-value for overlap Bonferroni significant markers 0.01; p-value for overlap nominal significant markers < 2.20e-16). We observed similar patterns with time to diagnosis and similar functional annotations for the markers that were identified in both settings compared to the markers that were only identified in the prediagnostic study. These results suggest that both gene sets operate in similar pathways. Conclusion An overlap exists between expression levels of genes predictive of CLL identified in prediagnostic blood and expression levels of genes associated to CLL at the clinical stage. Our analysis provides insight in a set of genes for which expression levels can be used to follow the time-course of the disease; providing an opportunity to study CLL progression in more detail in future studies. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3627-4) contains supplementary material, which is available to authorized users.
Collapse
|
10
|
Abstract
Comparing gene expression profiles measured in a wide range of different tissue types, at different developmental stages, or under different environmental conditions can yield valuable insights into the mechanisms of cell/tissue specification and differentiation, or identify cell/tissue-type specific responses to environmental stimuli. Critical for such comparisons is the identical processing of data from different sources. This may also include the integration of a novel data set into an existing collection of data sets (e.g., in-house and publicly available data). Here, I describe a complete workflow for RNA-Seq data, from data processing steps to the comparison of gene expression profiles measured with RNA-Seq. I use publicly available data for demonstration purposes, but I also describe how to integrate your own data sets. The workflow runs on all three major operating systems (Linux, MacOS, and Windows). The scripts and the tutorial can be accessed on github.com/MWSchmid/RNAseq_protocol .
Collapse
|
11
|
Improving the value of healthcare delivery using publicly available performance data in Wisconsin and California. HEALTHCARE-THE JOURNAL OF DELIVERY SCIENCE AND INNOVATION 2015; 2:85-9. [PMID: 26250373 DOI: 10.1016/j.hjdsi.2014.01.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2013] [Revised: 01/02/2014] [Accepted: 01/18/2014] [Indexed: 10/25/2022]
Abstract
The healthcare industry must change in order to provide higher quality care and lower costs for patients; one method to improve both cost and quality used in Wisconsin and California is leveraging publicly reported claims and costs data. Wisconsin has been building comprehensive, publicly available clinical and administrative data sets: the Wisconsin Collaborative for Healthcare Quality (WCHQ) established in 2003 and the Wisconsin Health Information Organization (WHIO) established in 2009. The WCHQ and the WHIO allow physician groups to compare themselves with one another on cost and quality across 920 distinct episode treatment groups (ETGs). The ETGs include all components of care for a specific disease during a defined period. Since 2002 California has developed public reporting of quality data for physician groups and health plans through its Integrated Healthcare Association (IHA) and since 2008 its Right Care Initiative (RCI). In both states these data are used to identify best practices and opportunities for improvement, enhance care outcomes, and increase value for patients.
Collapse
|
12
|
Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data. J Cheminform 2015; 7:9. [PMID: 25798198 PMCID: PMC4369291 DOI: 10.1186/s13321-015-0057-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2014] [Accepted: 02/23/2015] [Indexed: 11/12/2022] Open
Abstract
The current rise in the use of open lab notebook techniques means that there are an increasing number of scientists who make chemical information freely and openly available to the entire community as a series of micropublications that are released shortly after the conclusion of each experiment. We propose that this trend be accompanied by a thorough examination of data sharing priorities. We argue that the most significant immediate benefactor of open data is in fact chemical algorithms, which are capable of absorbing vast quantities of data, and using it to present concise insights to working chemists, on a scale that could not be achieved by traditional publication methods. Making this goal practically achievable will require a paradigm shift in the way individual scientists translate their data into digital form, since most contemporary methods of data entry are designed for presentation to humans rather than consumption by machine learning algorithms. We discuss some of the complex issues involved in fixing current methods, as well as some of the immediate benefits that can be gained when open data is published correctly using unambiguous machine readable formats. Lab notebook entries must target both visualisation by scientists and use by machine learning algorithms ![]()
Collapse
|
13
|
Creating a data exchange strategy for radiotherapy research: towards federated databases and anonymised public datasets. Radiother Oncol 2014; 113:303-9. [PMID: 25458128 PMCID: PMC4648243 DOI: 10.1016/j.radonc.2014.10.001] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2014] [Revised: 10/01/2014] [Accepted: 10/02/2014] [Indexed: 12/25/2022]
Abstract
Disconnected cancer research data management and lack of information exchange about planned and ongoing research are complicating the utilisation of internationally collected medical information for improving cancer patient care. Rapidly collecting/pooling data can accelerate translational research in radiation therapy and oncology. The exchange of study data is one of the fundamental principles behind data aggregation and data mining. The possibilities of reproducing the original study results, performing further analyses on existing research data to generate new hypotheses or developing computational models to support medical decisions (e.g. risk/benefit analysis of treatment options) represent just a fraction of the potential benefits of medical data-pooling. Distributed machine learning and knowledge exchange from federated databases can be considered as one beyond other attractive approaches for knowledge generation within “Big Data”. Data interoperability between research institutions should be the major concern behind a wider collaboration. Information captured in electronic patient records (EPRs) and study case report forms (eCRFs), linked together with medical imaging and treatment planning data, are deemed to be fundamental elements for large multi-centre studies in the field of radiation therapy and oncology. To fully utilise the captured medical information, the study data have to be more than just an electronic version of a traditional (un-modifiable) paper CRF. Challenges that have to be addressed are data interoperability, utilisation of standards, data quality and privacy concerns, data ownership, rights to publish, data pooling architecture and storage. This paper discusses a framework for conceptual packages of ideas focused on a strategic development for international research data exchange in the field of radiation therapy and oncology.
Collapse
|