1
|
Applying Affordance Theory to Big Data Analytics Adoption. ENTERP INF SYST-UK 2022. [DOI: 10.1007/978-3-031-08965-7_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
2
|
Fast readout method for multidimensional optical data storage using interferometry-aided reflectance spectroscopy. OPTICS EXPRESS 2021; 29:36608-36615. [PMID: 34809068 DOI: 10.1364/oe.440657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Accepted: 10/11/2021] [Indexed: 06/13/2023]
Abstract
The multiplex technique increases the capacity of optical data storage, but the current reading throughputs is limited by the single-bit reading. We propose a fast readout method of multidimensional optical data storage using interference-aided reflectance spectral measurement to readout multiple bits information simultaneously. The multidimensional data is recorded in the photoresist layer on the disc with dielectric multilayer substrate by laser direct writing. With the designed interference layer inside the disc, the relation of thickness of recording layer and the peak shift of the reflected spectra has been built up. With different writing depths representing different bit of data, 2 bits and 3 bits unit information have been recorded and successfully read out at one exposure. This fast readout method is not only suitable for optical data storage by engineering the optical path length for example Blu-ray disc but also for super resolution optical data storage.
Collapse
|
3
|
Abstract
While DNA's perpetual role in biology and life science is well documented, its burgeoning digital applications are beginning to garner significant interest. As the development of novel technologies requires continuous research, product development, startup creation, and financing, this work provides an overview of each respective area and highlights current trends, challenges, and opportunities. These are supported by numerous interviews with key opinion leaders from across academia, government agencies and the commercial sector, as well as investment data analysis. Our findings illustrate the societal and economic need for technological innovation and disruption in data storage, paving the way for nature's own time-tested, advantageous, and unrivaled solution. We anticipate a significant increase in available investment capital and continuous scientific progress, creating a ripe environment on which DNA data storage-enabling startups can capitalize to bring DNA data storage into daily life. Overview on current DNA data storage technologies and commercialization hurdles Insights from leading DNA data storage experts and investment financing data DNA synthesis remains the biggest challenge in the industry Archiving cold data is the low-hanging fruit in DNA data storage Upwards trend in investment landscape suggests optimal startup fundraising period
Collapse
|
4
|
Using Topomer Comparative Molecular Field Analysis to Elucidate Activity Differences of Aminomethylenethiophene Derivatives as Lysyl Oxidase Inhibitors: Implications for Rational Design of Antimetastatic Agents for Cancer Therapy. J CHEM-NY 2020. [DOI: 10.1155/2020/2036585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Topomer comparative molecular field analysis (topomer CoMFA) is applied to the quantitative structure-activity relationship (QSAR) study of aminomethylenethiophene (AMT) derivatives as lysyl oxidase (LOX) inhibitors. A total of thirty-six AMT derivatives were selected to build the QSAR model. The established topomer CoMFA model has the non-cross-validated correlation coefficient (r2) of 0.912 and the leave-one-out correlation coefficient (q2) of 0.540, which is statistically significant. The theoretically predicted anti-LOX potency agrees well with the experimentally observed inhibitory activity, proving the reasonable predictive ability of the QSAR model. The effect of molecular field information on the LOX inhibition of substituted aminomethylenethiophene was discussed in detail. The structural modification of the aminomethylenethiophene scaffold was carried out, and novel AMT derivatives with theoretically decent LOX inhibition were proposed. The topomer CoMFA modeling could provide a quantitative perspective into the structure-activity relationship of AMT derivatives and potentially speed up the rational design of LOX inhibitors as antimetastatic agents for cancer therapy.
Collapse
|
5
|
Abstract
We report on the activities of the 2015 edition of the BioHackathon, an annual event that brings together researchers and developers from around the world to develop tools and technologies that promote the reusability of biological data. We discuss issues surrounding the representation, publication, integration, mining and reuse of biological data and metadata across a wide range of biomedical data types of relevance for the life sciences, including chemistry, genotypes and phenotypes, orthology and phylogeny, proteomics, genomics, glycomics, and metabolomics. We describe our progress to address ongoing challenges to the reusability and reproducibility of research results, and identify outstanding issues that continue to impede the progress of bioinformatics research. We share our perspective on the state of the art, continued challenges, and goals for future research and development for the life sciences Semantic Web.
Collapse
|
6
|
Modulating trap properties by Nd 3+- Eu 3+ co-doping in Sr 2SnO 4 host for optical information storage. OPTICS EXPRESS 2020; 28:4249-4257. [PMID: 32122081 DOI: 10.1364/oe.386164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 01/21/2020] [Indexed: 06/10/2023]
Abstract
We report a novel Nd3+ and Eu3+ co-doped Sr2SnO4 (SSONE) phosphor showing the capability of "write-in" and "read-out" in optical information storage. As-prepared phosphors exhibit a dominant emission (PL) band centered at 596 nm under UV excitation, closely identical with its photo-stimulated luminescence (PSL) spectrum center (595 nm) upon near-infrared (NIR) light and thermal-stimulated luminescence (TSL) spectrum center (595 nm) under heat source. Remarkably, compared with Eu3+ single-doped phosphors, the co-doping strategy enhances the deep traps and also separates the deep traps with shallow traps, which are very crucial factors for optical information storage in electron trapping materials. Further, a demonstration confirmed the optical information storage capacity by photo- and thermal-stimulating the prepared phosphors filled in the designed patterns.
Collapse
|
7
|
Tailoring Multidimensional Traps for Rewritable Multilevel Optical Data Storage. ACS APPLIED MATERIALS & INTERFACES 2019; 11:35023-35029. [PMID: 31474109 DOI: 10.1021/acsami.9b13011] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
In the current "big data" era, the state-of-the-art optical data storage (ODS) has become a front-runner in the competing data storage technologies. As one of the most promising methods for breaking the physical limitation suffered by traditional ones, the advance of optically stimulated luminescence (OSL) based optical storage technique is now still limited by the simultaneous single-level write-in and readout in a same spot. In this work, to bridge the data-capacity gap, we report for the first time a novel and promising nonphysical multidimensional OSL-based ODS flexible medium for erasable multilevel optical data recording and reading. We tailor multidimensional traps with discrete, narrowly distributed energy levels through (multi-)codoping of selective trivalent rare-earth ions into Eu2+-activated barium orthosilicate (Ba2SiO4). Upon UV/blue light illumination, information can be sequentially recorded in different traps assisted by thermal cleaning with an increase of storage capacity by orders of magnitude, which is addressable individually in the whole domain or bit-by-bit mode without the crosstalk by designed thermal/optical stimuli. Remarkably, good data retention and robust fatigue resistance have been achieved in recycle data recording. Insight is forged from charge carrier dynamics and interactions with traps for a universal method of data storage, and proof-of-concept applications are also demonstrated, thereby providing the way to not only rewritable multilevel ODS but also high-security encryption/decryption.
Collapse
|
8
|
The Impact of Big Data Analytics on Company Performance in Supply Chain Management. SUSTAINABILITY 2019. [DOI: 10.3390/su11184864] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Big data analytics can add value and provide a new perspective by improving predictive analysis and modeling practices. This research is centered on supply-chain management and how big data analytics can help Romanian supply-chain companies assess their experience, strategies, and professional capabilities in successfully implementing big data analytics, as well as assessing the tools needed to achieve these goals, including the results of implementation and performance achievement based on them. The research method used in the quantitative study was a sampling survey, using a questionnaire as a data collection tool. It included closed questions, measured with nominal and ordinal scales. A total of 205 managers provided complete and useful answers for this research. The collected data were analyzed with the Statistical Package for the Social Sciences (SPSS) package using frequency tables, contingency tables, and main component analysis. The major contributions of this research highlight the fact that companies are concerned with identifying new statistical methods, tools, and approaches, such as cloud computing and security technologies, that need to be rigorously explored.
Collapse
|
9
|
Look before you leap: Barriers to big data use in municipalities. INFORMATION POLITY 2019. [DOI: 10.3233/ip-180090] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
10
|
Business intelligence and analytics for value creation: The role of absorptive capacity. INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT 2019. [DOI: 10.1016/j.ijinfomgt.2018.11.020] [Citation(s) in RCA: 66] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
11
|
Abstract
Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer.In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel.We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters.By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.
Collapse
|
12
|
Tailoring Trap Depth and Emission Wavelength in Y 3Al 5- xGa xO 12:Ce 3+,V 3+ Phosphor-in-Glass Films for Optical Information Storage. ACS APPLIED MATERIALS & INTERFACES 2018; 10:27150-27159. [PMID: 30044082 DOI: 10.1021/acsami.8b10713] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Deep-trap persistent luminescent materials, due to their exceptional ability of energy storage and controllable photon release under external stimulation, have attracted considerable attention in the field of optical information storage. Currently, the lack of suitable materials is still the bottleneck that restrains their practical applications. Herein, we successfully synthesized a series of deep-trap persistent luminescent materials Y3Al5- xGa xO12:Ce3+,V3+ ( x = 0-3) with a garnet structure and developed novel phosphor-in-glass (PiG) films containing these phosphors. The synthesized PiG films exhibited sufficiently deep traps, narrow trap depth distributions, high trap density, high quantum efficiency, and excellent chemical stability, which solved the problem of chemical stability at high temperatures in the reported phosphor-in-silicone films. Moreover, the trap depth in the phosphors and PiG films could be tailored from 1.2 to 1.6 eV, thanks to the bandgap engineering effect, and the emission color was simultaneously changed from green to yellow due to the variation of crystal field strength. Image information was recorded on the PiG films by using a 450 nm blue-light laser in a laser direct writing mode and the recorded information was retrieved under high-temperature thermal stimulation or photostimulation. The Y3Al5- xGa xO12:Ce3+,V3+ PiG films as presented in this work are very promising in the applications of multidimensional and rewritable optical information storage.
Collapse
|
13
|
Abstract
INTRODUCTION The development of improved cancer therapies is frequently cited as an urgent unmet medical need. Recent advances in platform technologies and the increasing availability of biological 'big data' are providing an unparalleled opportunity to systematically identify the key genes and pathways involved in tumorigenesis. The discoveries made using these new technologies may lead to novel therapeutic interventions. Areas covered: The authors discuss the current approaches that use 'big data' to identify cancer drivers. These approaches include the analysis of genomic sequencing data, pathway data, multi-platform data, identifying genetic interactions such as synthetic lethality and using cell line data. They review how big data is being used to identify novel drug targets. The authors then provide an overview of the available data repositories and tools being used at the forefront of cancer drug discovery. Expert opinion: Targeted therapies based on the genomic events driving the tumour will eventually inform treatment protocols. However, using a tailored approach to treat all tumour patients may require developing a large repertoire of targeted drugs.
Collapse
|
14
|
Abstract
This review highlights the fundamental role of nutrition in the maintenance of health, the immune response, and disease prevention. Emerging global mechanistic insights in the field of nutritional immunology cannot be gained through reductionist methods alone or by analyzing a single nutrient at a time. We propose to investigate nutritional immunology as a massively interacting system of interconnected multistage and multiscale networks that encompass hidden mechanisms by which nutrition, microbiome, metabolism, genetic predisposition, and the immune system interact to delineate health and disease. The review sets an unconventional path to apply complex science methodologies to nutritional immunology research, discovery, and development through “use cases” centered around the impact of nutrition on the gut microbiome and immune responses. Our systems nutritional immunology analyses, which include modeling and informatics methodologies in combination with pre-clinical and clinical studies, have the potential to discover emerging systems-wide properties at the interface of the immune system, nutrition, microbiome, and metabolism.
Collapse
|
15
|
Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining. PROCEEDINGS OF THE IEEE. INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS 2016; 104:93-110. [PMID: 27087700 PMCID: PMC4827453 DOI: 10.1109/jproc.2015.2494178] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
When can reliable inference be drawn in fue "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data". Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.
Collapse
|
16
|
BioDB extractor: customized data extraction system for commonly used bioinformatics databases. BioData Min 2015; 8:31. [PMID: 26516349 PMCID: PMC4624652 DOI: 10.1186/s13040-015-0067-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2015] [Accepted: 10/20/2015] [Indexed: 12/15/2022] Open
Abstract
Background Diverse types of biological data, primary as well as derived, are available in various formats and are stored in heterogeneous resources. Database-specific as well as integrated search engines are available for carrying out efficient searches of databases. These search engines however, do not support extraction of subsets of data with the same level of granularity that exists in typical database entries. In order to extract fine grained subsets of data, users are required to download complete or partial database entries and write scripts for parsing and extraction. Results BioDBExtractor (BDE) has been developed to provide 26 customized data extraction utilities for some of the commonly used databases such as ENA (EMBL-Bank), UniprotKB, PDB, and KEGG. BDE eliminates the need for downloading entries and writing scripts. BDE has a simple web interface that enables input of query in the form of accession numbers/ID codes, choice of utilities and selection of fields/subfields of data by the users. Conclusions BDE thus provides a common data extraction platform for multiple databases and is useful to both, novice and expert users. BDE, however, is not a substitute to basic keyword-based database searches. Desired subsets of data, compiled using BDE can be subsequently used for downstream processing, analyses and knowledge discovery. Availability BDE can be accessed from http://bioinfo.net.in/BioDB/Home.html.
Collapse
|
17
|
|
18
|
Abstract
Genomics is a Big Data science and is going to get much bigger, very soon, but it is not known whether the needs of genomics will exceed other Big Data domains. Projecting to the year 2025, we compared genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. Our estimates show that genomics is a “four-headed beast”—it is either on par with or the most demanding of the domains analyzed here in terms of data acquisition, storage, distribution, and analysis. We discuss aspects of new technologies that will need to be developed to rise up and meet the computational challenges that genomics poses for the near future. Now is the time for concerted, community-wide planning for the “genomical” challenges of the next decade. This perspective considers the growth of genomics over the next ten years and assesses the computational needs that we will face relative to other "Big Data" activities such as astronomy, YouTube, and Twitter.
Collapse
|
19
|
Pheno2Geno - High-throughput generation of genetic markers and maps from molecular phenotypes for crosses between inbred strains. BMC Bioinformatics 2015; 16:51. [PMID: 25886992 PMCID: PMC4339742 DOI: 10.1186/s12859-015-0475-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2014] [Accepted: 01/26/2015] [Indexed: 11/11/2022] Open
Abstract
Background Genetic markers and maps are instrumental in quantitative trait locus (QTL) mapping in segregating populations. The resolution of QTL localization depends on the number of informative recombinations in the population and how well they are tagged by markers. Larger populations and denser marker maps are better for detecting and locating QTLs. Marker maps that are initially too sparse can be saturated or derived de novo from high-throughput omics data, (e.g. gene expression, protein or metabolite abundance). If these molecular phenotypes are affected by genetic variation due to a major QTL they will show a clear multimodal distribution. Using this information, phenotypes can be converted into genetic markers. Results The Pheno2Geno tool uses mixture modeling to select phenotypes and transform them into genetic markers suitable for construction and/or saturation of a genetic map. Pheno2Geno excludes candidate genetic markers that show evidence for multiple possibly epistatically interacting QTL and/or interaction with the environment, in order to provide a set of robust markers for follow-up QTL mapping. We demonstrate the use of Pheno2Geno on gene expression data of 370,000 probes in 148 A. thaliana recombinant inbred lines. Pheno2Geno is able to saturate the existing genetic map, decreasing the average distance between markers from 7.1 cM to 0.89 cM, close to the theoretical limit of 0.68 cM (with 148 individuals we expect a recombination every 100/148=0.68 cM); this pinpointed almost all of the informative recombinations in the population. Conclusion The Pheno2Geno package makes use of genome-wide molecular profiling and provides a tool for high-throughput de novo map construction and saturation of existing genetic maps. Processing of the showcase dataset takes less than 30 minutes on an average desktop PC. Pheno2Geno improves QTL mapping results at no additional laboratory cost and with minimum computational effort. Its results are formatted for direct use in R/qtl, the leading R package for QTL studies. Pheno2Geno is freely available on CRAN under “GNU GPL v3”. The Pheno2Geno package as well as the tutorial can also be found at: http://pheno2geno.nl. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0475-6) contains supplementary material, which is available to authorized users.
Collapse
|
20
|
Sambamba: fast processing of NGS alignment formats. Bioinformatics 2015; 31:2032-4. [PMID: 25697820 PMCID: PMC4765878 DOI: 10.1093/bioinformatics/btv098] [Citation(s) in RCA: 1056] [Impact Index Per Article: 117.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2014] [Accepted: 02/10/2015] [Indexed: 11/13/2022] Open
Abstract
Summary: Sambamba is a high-performance robust tool and library for working with SAM, BAM and CRAM sequence alignment files; the most common file formats for aligned next generation sequencing data. Sambamba is a faster alternative to samtools that exploits multi-core processing and dramatically reduces processing time. Sambamba is being adopted at sequencing centers, not only because of its speed, but also because of additional functionality, including coverage analysis and powerful filtering capability. Availability and implementation: Sambamba is free and open source software, available under a GPLv2 license. Sambamba can be downloaded and installed from http://www.open-bio.org/wiki/Sambamba. Sambamba v0.5.0 was released with doi:10.5281/zenodo.13200. Contact: j.c.p.prins@umcutrecht.nl
Collapse
|
21
|
|
22
|
Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections. BMC Med Res Methodol 2014; 14:99. [PMID: 25154404 PMCID: PMC4146451 DOI: 10.1186/1471-2288-14-99] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2014] [Accepted: 08/14/2014] [Indexed: 12/19/2022] Open
Abstract
Background Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome. Methods We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected using two data mining methods, the Random Forests (RF) and the Boosted Regression Trees (BRT); the conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regression - UFMLR) and the Least Absolute Shrinkage and Selection Operator (LASSO) with penalty in multivariate logistic regression to achieve a sparse selection of covariates. We developed permutations tests to assess the statistical significance of associations. We simulated 500 similar sized datasets to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods. Results Between 3 and 24 covariates (1%-8%) were identified as associated with influenza infection depending on the method. The pre-seasonal haemagglutination inhibition antibody titer was the unique covariate selected with all methods while 266 (87%) covariates were not selected by any method. At 5% nominal significance level, the TPR were 85% with RF, 80% with BRT, 26% to 49% with UFMLR, 71% to 78% with LASSO. Conversely, the FPR were 4% with RF and BRT, 9% to 2% with UFMLR, and 9% to 4% with LASSO. Conclusions Data mining methods and LASSO should be considered as valuable methods to detect independent associations in large epidemiologic datasets.
Collapse
|
23
|
Harnessing the power of big data: infusing the scientific method with machine learning to transform ecology. Ecosphere 2014. [DOI: 10.1890/es13-00359.1] [Citation(s) in RCA: 92] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
24
|
Abstract
Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. This ever-increasing sheer volume has made it difficult for scientists to effectively and accurately access figures of their interest, the process of which is crucial for validating research facts and for formulating or testing novel research hypotheses. Current figure search applications can't fully meet this challenge as the “bag of figures” assumption doesn't take into account the relationship among figures. In our previous study, hundreds of biomedical researchers have annotated articles in which they serve as corresponding authors. They ranked each figure in their paper based on a figure's importance at their discretion, referred to as “figure ranking”. Using this collection of annotated data, we investigated computational approaches to automatically rank figures. We exploited and extended the state-of-the-art listwise learning-to-rank algorithms and developed a new supervised-learning model BioFigRank. The cross-validation results show that BioFigRank yielded the best performance compared with other state-of-the-art computational models, and the greedy feature selection can further boost the ranking performance significantly. Furthermore, we carry out the evaluation by comparing BioFigRank with three-level competitive domain-specific human experts: (1) First Author, (2) Non-Author-In-Domain-Expert who is not the author nor co-author of an article but who works in the same field of the corresponding author of the article, and (3) Non-Author-Out-Domain-Expert who is not the author nor co-author of an article and who may or may not work in the same field of the corresponding author of an article. Our results show that BioFigRank outperforms Non-Author-Out-Domain-Expert and performs as well as Non-Author-In-Domain-Expert. Although BioFigRank underperforms First Author, since most biomedical researchers are either in- or out-domain-experts for an article, we conclude that BioFigRank represents an artificial intelligence system that offers expert-level intelligence to help biomedical researchers to navigate increasingly proliferated big data efficiently.
Collapse
|
25
|
Big data and clinicians: a review on the state of the science. JMIR Med Inform 2014; 2:e1. [PMID: 25600256 PMCID: PMC4288113 DOI: 10.2196/medinform.2913] [Citation(s) in RCA: 97] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2013] [Revised: 11/25/2013] [Accepted: 12/08/2013] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND In the past few decades, medically related data collection saw a huge increase, referred to as big data. These huge datasets bring challenges in storage, processing, and analysis. In clinical medicine, big data is expected to play an important role in identifying causality of patient symptoms, in predicting hazards of disease incidence or reoccurrence, and in improving primary-care quality. OBJECTIVE The objective of this review was to provide an overview of the features of clinical big data, describe a few commonly employed computational algorithms, statistical methods, and software toolkits for data manipulation and analysis, and discuss the challenges and limitations in this realm. METHODS We conducted a literature review to identify studies on big data in medicine, especially clinical medicine. We used different combinations of keywords to search PubMed, Science Direct, Web of Knowledge, and Google Scholar for literature of interest from the past 10 years. RESULTS This paper reviewed studies that analyzed clinical big data and discussed issues related to storage and analysis of this type of data. CONCLUSIONS Big data is becoming a common feature of biological and clinical studies. Researchers who use clinical big data face multiple challenges, and the data itself has limitations. It is imperative that methodologies for data analysis keep pace with our ability to collect and store data.
Collapse
|
26
|
|
27
|
FRESCO: Referential compression of highly similar sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1275-1288. [PMID: 24524158 DOI: 10.1109/tcbb.2013.122] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
In many applications, sets of similar texts or sequences are of high importance. Prominent examples are revision histories of documents or genomic sequences. Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever-increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. In this paper, we propose a general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO). Our basic compression algorithm is shown to be one to two orders of magnitudes faster than comparable related work, while achieving similar compression ratios. We also propose several techniques to further increase compression ratios, while still retaining the advantage in speed: 1) selecting a good reference sequence; and 2) rewriting a reference sequence to allow for better compression. In addition,we propose a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression). This technique allows for compression ratios way beyond state of the art, for instance,4,000:1 and higher for human genomes. We evaluate our algorithms on a large data set from three different species (more than 1,000 genomes, more than 3 TB) and on a collection of versions of Wikipedia pages. Our results show that real-time compression of highly similar sequences at high compression ratios is possible on modern hardware.
Collapse
|
28
|
|
29
|
KNODWAT: a scientific framework application for testing knowledge discovery methods for the biomedical domain. BMC Bioinformatics 2013; 14:191. [PMID: 23763826 PMCID: PMC3691758 DOI: 10.1186/1471-2105-14-191] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2013] [Accepted: 05/31/2013] [Indexed: 12/05/2022] Open
Abstract
Background Professionals in the biomedical domain are confronted with an increasing mass of data. Developing methods to assist professional end users in the field of Knowledge Discovery to identify, extract, visualize and understand useful information from these huge amounts of data is a huge challenge. However, there are so many diverse methods and methodologies available, that for biomedical researchers who are inexperienced in the use of even relatively popular knowledge discovery methods, it can be very difficult to select the most appropriate method for their particular research problem. Results A web application, called KNODWAT (KNOwledge Discovery With Advanced Techniques) has been developed, using Java on Spring framework 3.1. and following a user-centered approach. The software runs on Java 1.6 and above and requires a web server such as Apache Tomcat and a database server such as the MySQL Server. For frontend functionality and styling, Twitter Bootstrap was used as well as jQuery for interactive user interface operations. Conclusions The framework presented is user-centric, highly extensible and flexible. Since it enables methods for testing using existing data to assess suitability and performance, it is especially suitable for inexperienced biomedical researchers, new to the field of knowledge discovery and data mining. For testing purposes two algorithms, CART and C4.5 were implemented using the WEKA data mining framework.
Collapse
|
30
|
When Medicine Meets Engineering-Paradigm Shifts in Diagnostics and Therapeutics. Diagnostics (Basel) 2013; 3:126-54. [PMID: 26835672 PMCID: PMC4665584 DOI: 10.3390/diagnostics3010126] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2012] [Revised: 01/10/2013] [Accepted: 01/23/2013] [Indexed: 01/09/2023] Open
Abstract
During the last two decades, the manufacturing techniques of microfluidics-based devices have been phenomenally advanced, offering unlimited potential for bio-medical technologies. However, the direct applications of these technologies toward diagnostics and therapeutics are still far from maturity. The present challenges lay at the interfaces between the engineering systems and the biocomplex systems. A precisely designed engineering system with narrow dynamic range is hard to seamlessly integrate with the adaptive biological system in order to achieve the design goals. These differences remain as the roadblock between two fundamentally non-compatible systems. This paper will not extensively review the existing microfluidic sensors and actuators; rather, we will discuss the sources of the gaps for integration. We will also introduce system interface technologies for bridging the differences to lead toward paradigm shifts in diagnostics and therapeutics.
Collapse
|
31
|
Abstract
The rapid technological developments following the Human Genome Project have made possible the availability of personalized genomes. As the focus now shifts from characterizing genomes to making personalized disease associations, in combination with the availability of other omics technologies, the next big push will be not only to obtain a personalized genome, but to quantitatively follow other omics. This will include transcriptomes, proteomes, metabolomes, antibodyomes, and new emerging technologies, enabling the profiling of thousands of molecular components in individuals. Furthermore, omics profiling performed longitudinally can probe the temporal patterns associated with both molecular changes and associated physiological health and disease states. Such data necessitates the development of computational methodology to not only handle and descriptively assess such data, but also construct quantitative biological models. Here we describe the availability of personal genomes and developing omics technologies that can be brought together for personalized implementations and how these novel integrated approaches may effectively provide a precise personalized medicine that focuses on not only characterization and treatment but ultimately the prevention of disease.
Collapse
|
32
|
|
33
|
Abstract
Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, memory requirements of the current algorithms are high and run times often are slow. In this paper, we propose an adaptive, parallel and highly efficient referential sequence compression method which allows fine-tuning of the trade-off between required memory and compression speed. When using 12 MB of memory, our method is for human genomes on-par with the best previous algorithms in terms of compression ratio (400:1) and compression speed. In contrast, it compresses a complete human genome in just 11 seconds when provided with 9 GB of main memory, which is almost three times faster than the best competitor while using less main memory.
Collapse
|
34
|
Imaging without lenses: achievements and remaining challenges of wide-field on-chip microscopy. Nat Methods 2012; 9:889-95. [PMID: 22936170 DOI: 10.1038/nmeth.2114] [Citation(s) in RCA: 237] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
We discuss unique features of lens-free computational imaging tools and report some of their emerging results for wide-field on-chip microscopy, such as the achievement of a numerical aperture (NA) of ∼0.8-0.9 across a field of view (FOV) of more than 20 mm(2) or an NA of ∼0.1 across a FOV of ∼18 cm(2), which corresponds to an image with more than 1.5 gigapixels. We also discuss the current challenges that these computational on-chip microscopes face, shedding light on their future directions and applications.
Collapse
|
35
|
Proceedings of the Eleventh Annual UT-ORNL-KBRIN Bioinformatics Summit 2012. BMC Bioinformatics 2012; 13 Suppl 12:A1-24. [PMID: 22873757 PMCID: PMC3409059 DOI: 10.1186/1471-2105-13-s12-a1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
36
|
Abstract
Summary:xQTL workbench is a scalable web platform for the mapping of quantitative trait loci (QTLs) at multiple levels: for example gene expression (eQTL), protein abundance (pQTL), metabolite abundance (mQTL) and phenotype (phQTL) data. Popular QTL mapping methods for model organism and human populations are accessible via the web user interface. Large calculations scale easily on to multi-core computers, clusters and Cloud. All data involved can be uploaded and queried online: markers, genotypes, microarrays, NGS, LC-MS, GC-MS, NMR, etc. When new data types come available, xQTL workbench is quickly customized using the Molgenis software generator. Availability:xQTL workbench runs on all common platforms, including Linux, Mac OS X and Windows. An online demo system, installation guide, tutorials, software and source code are available under the LGPL3 license from http://www.xqtl.org. Contact:m.a.swertz@rug.nl
Collapse
|
37
|
New and emerging analytical techniques for marine biotechnology. Curr Opin Biotechnol 2012; 23:29-33. [PMID: 22265377 DOI: 10.1016/j.copbio.2011.12.007] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2011] [Revised: 12/12/2011] [Accepted: 12/13/2011] [Indexed: 11/25/2022]
Abstract
Marine biotechnology is the industrial, medical or environmental application of biological resources from the sea. Since the marine environment is the most biologically and chemically diverse habitat on the planet, marine biotechnology has, in recent years delivered a growing number of major therapeutic products, industrial and environmental applications and analytical tools. These range from the use of a snail toxin to develop a pain control drug, metabolites from a sea squirt to develop an anti-cancer therapeutic, and marine enzymes to remove bacterial biofilms. In addition, well known and broadly used analytical techniques are derived from marine molecules or enzymes, including green fluorescence protein gene tagging methods and heat resistant polymerases used in the polymerase chain reaction. Advances in bacterial identification, metabolic profiling and physical handling of cells are being revolutionised by techniques such as mass spectrometric analysis of bacterial proteins. Advances in instrumentation and a combination of these physical advances with progress in proteomics and bioinformatics are accelerating our ability to harness biology for commercial gain. Single cell Raman spectroscopy and microfluidics are two emerging techniques which are also discussed elsewhere in this issue. In this review, we provide a brief survey and update of the most powerful and rapidly growing analytical techniques as used in marine biotechnology, together with some promising examples of less well known earlier stage methods which may make a bigger impact in the future.
Collapse
|
38
|
Abstract
Genomic data analysis in evolutionary biology is becoming so computationally intensive that analysis of multiple hypotheses and scenarios takes too long on a single desktop computer. In this chapter, we discuss techniques for scaling computations through parallelization of calculations, after giving a quick overview of advanced programming techniques. Unfortunately, parallel programming is difficult and requires special software design. The alternative, especially attractive for legacy software, is to introduce poor man's parallelization by running whole programs in parallel as separate processes, using job schedulers. Such pipelines are often deployed on bioinformatics computer clusters. Recent advances in PC virtualization have made it possible to run a full computer operating system, with all of its installed software, on top of another operating system, inside a "box," or virtual machine (VM). Such a VM can flexibly be deployed on multiple computers, in a local network, e.g., on existing desktop PCs, and even in the Cloud, to create a "virtual" computer cluster. Many bioinformatics applications in evolutionary biology can be run in parallel, running processes in one or more VMs. Here, we show how a ready-made bioinformatics VM image, named BioNode, effectively creates a computing cluster, and pipeline, in a few steps. This allows researchers to scale-up computations from their desktop, using available hardware, anytime it is required. BioNode is based on Debian Linux and can run on networked PCs and in the Cloud. Over 200 bioinformatics and statistical software packages, of interest to evolutionary biology, are included, such as PAML, Muscle, MAFFT, MrBayes, and BLAST. Most of these software packages are maintained through the Debian Med project. In addition, BioNode contains convenient configuration scripts for parallelizing bioinformatics software. Where Debian Med encourages packaging free and open source bioinformatics software through one central project, BioNode encourages creating free and open source VM images, for multiple targets, through one central project. BioNode can be deployed on Windows, OSX, Linux, and in the Cloud. Next to the downloadable BioNode images, we provide tutorials online, which empower bioinformaticians to install and run BioNode in different environments, as well as information for future initiatives, on creating and building such images.
Collapse
|
39
|
Towards big data science in the decade ahead from ten years of InCoB and the 1st ISCB-Asia Joint Conference. BMC Bioinformatics 2011; 12 Suppl 13:S1. [PMID: 22372736 PMCID: PMC3278825 DOI: 10.1186/1471-2105-12-s13-s1] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
The 2011 International Conference on Bioinformatics (InCoB) conference, which is the annual scientific conference of the Asia-Pacific Bioinformatics Network (APBioNet), is hosted by Kuala Lumpur, Malaysia, is co-organized with the first ISCB-Asia conference of the International Society for Computational Biology (ISCB). InCoB and the sequencing of the human genome are both celebrating their tenth anniversaries and InCoB’s goalposts for the next decade, implementing standards in bioinformatics and globally distributed computational networks, will be discussed and adopted at this conference. Of the 49 manuscripts (selected from 104 submissions) accepted to BMC Genomics and BMC Bioinformatics conference supplements, 24 are featured in this issue, covering software tools, genome/proteome analysis, systems biology (networks, pathways, bioimaging) and drug discovery and design.
Collapse
|
40
|
CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics 2011; 12:356. [PMID: 21878105 PMCID: PMC3228541 DOI: 10.1186/1471-2105-12-356] [Citation(s) in RCA: 227] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2011] [Accepted: 08/30/2011] [Indexed: 11/23/2022] Open
Abstract
Background Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software. Results We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms. Conclusion The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.
Collapse
|
41
|
Bioinformatics tools and database resources for systems genetics analysis in mice--a short review and an evaluation of future needs. Brief Bioinform 2011; 13:135-42. [PMID: 22396485 PMCID: PMC3294237 DOI: 10.1093/bib/bbr026] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
During a meeting of the SYSGENET working group ‘Bioinformatics’, currently available software tools and databases for systems genetics in mice were reviewed and the needs for future developments discussed. The group evaluated interoperability and performed initial feasibility studies. To aid future compatibility of software and exchange of already developed software modules, a strong recommendation was made by the group to integrate HAPPY and R/qtl analysis toolboxes, GeneNetwork and XGAP database platforms, and TIQS and xQTL processing platforms. R should be used as the principal computer language for QTL data analysis in all platforms and a ‘cloud’ should be used for software dissemination to the community. Furthermore, the working group recommended that all data models and software source code should be made visible in public repositories to allow a coordinated effort on the use of common data structures and file formats.
Collapse
|
42
|
Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology. Nat Rev Genet 2011; 12:224. [DOI: 10.1038/nrg2857-c2] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|