1
|
Jamil HM. A Visual Interface for Querying Heterogeneous Phylogenetic Databases. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:131-144. [PMID: 26812733 DOI: 10.1109/tcbb.2016.2520943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Despite the recent growth in the number of phylogenetic databases, access to these wealth of resources remain largely tool or form-based interface driven. It is our thesis that the flexibility afforded by declarative query languages may offer the opportunity to access these repositories in a better way, and to use such a language to pose truly powerful queries in unprecedented ways. In this paper, we propose a substantially enhanced closed visual query language, called PhyQL, that can be used to query phylogenetic databases represented in a canonical form. The canonical representation presented helps capture most phylogenetic tree formats in a convenient way, and is used as the storage model for our PhyloBase database for which PhyQL serves as the query language. We have implemented a visual interface for the end users to pose PhyQL queries using visual icons, and drag and drop operations defined over them. Once a query is posed, the interface translates the visual query into a Datalog query for execution over the canonical database. Responses are returned as hyperlinks to phylogenies that can be viewed in several formats using the tree viewers supported by PhyloBase. Results cached in PhyQL buffer allows secondary querying on the computed results making it a truly powerful querying architecture.
Collapse
|
2
|
Jamil HM. Improving Integration Effectiveness of ID Mapping Based Biological Record Linkage. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:473-486. [PMID: 26357233 DOI: 10.1109/tcbb.2014.2355213] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Traditionally, biological objects such as genes, proteins, and pathways are represented by a convenient identifier, or ID, which is then used to cross reference, link and describe objects in biological databases. Relationships among the objects are often established using non-trivial and computationally complex ID mapping systems or converters, and are stored in authoritative databases such as UniGene, GeneCards, PIR and BioMart. Despite best efforts, such mappings are largely incomplete and riddled with false negatives. Consequently, data integration using record linkage that relies on these mappings produces poor quality of data, inadvertently leading to erroneous conclusions. In this paper, we discuss this largely ignored dimension of data integration, examine how the ubiquitous use of identifiers in biological databases is a significant barrier to knowledge fusion using distributed computational pipelines, and propose two algorithms for ad hoc and restriction free ID mapping of arbitrary types using online resources. We also propose two declarative statements for ID conversion and data integration based on ID mapping on-the-fly.
Collapse
|
3
|
Kumar A, Grupcev V, Berrada M, Fogarty JC, Tu YC, Zhu X, Pandit SA, Xia Y. DCMS: A data analytics and management system for molecular simulation. JOURNAL OF BIG DATA 2014; 2:9. [PMID: 26069879 PMCID: PMC4456345 DOI: 10.1186/s40537-014-0009-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/30/2014] [Accepted: 11/06/2014] [Indexed: 06/04/2023]
Abstract
Molecular Simulation (MS) is a powerful tool for studying physical/chemical features of large systems and has seen applications in many scientific and engineering domains. During the simulation process, the experiments generate a very large number of atoms and intend to observe their spatial and temporal relationships for scientific analysis. The sheer data volumes and their intensive interactions impose significant challenges for data accessing, managing, and analysis. To date, existing MS software systems fall short on storage and handling of MS data, mainly because of the missing of a platform to support applications that involve intensive data access and analytical process. In this paper, we present the database-centric molecular simulation (DCMS) system our team developed in the past few years. The main idea behind DCMS is to store MS data in a relational database management system (DBMS) to take advantage of the declarative query interface (i.e., SQL), data access methods, query processing, and optimization mechanisms of modern DBMSs. A unique challenge is to handle the analytical queries that are often compute-intensive. For that, we developed novel indexing and query processing strategies (including algorithms running on modern co-processors) as integrated components of the DBMS. As a result, researchers can upload and analyze their data using efficient functions implemented inside the DBMS. Index structures are generated to store analysis results that may be interesting to other users, so that the results are readily available without duplicating the analysis. We have developed a prototype of DCMS based on the PostgreSQL system and experiments using real MS data and workload show that DCMS significantly outperforms existing MS software systems. We also used it as a platform to test other data management issues such as security and compression.
Collapse
Affiliation(s)
- Anand Kumar
- />Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB118, Tampa, 33620 Florida USA
| | - Vladimir Grupcev
- />Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB118, Tampa, 33620 Florida USA
| | - Meryem Berrada
- />Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB118, Tampa, 33620 Florida USA
| | - Joseph C Fogarty
- />Department of Physics, University of South Florida, 4202 E. Fowler Ave., PHY114, Tampa, 33620 Florida USA
| | - Yi-Cheng Tu
- />Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB118, Tampa, 33620 Florida USA
| | - Xingquan Zhu
- />Department of Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, EE308, Boca Raton, 33431 Florida USA
| | - Sagar A Pandit
- />Department of Physics, University of South Florida, 4202 E. Fowler Ave., PHY114, Tampa, 33620 Florida USA
| | - Yuni Xia
- />Department of Computer Science, Indiana University - Purdue University Indianapolis, 723 W. Michigan St, SL280E, Indianapolis, 46202 Indiana USA
| |
Collapse
|
4
|
Jamil HM. Designing integrated computational biology pipelines visually. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:605-618. [PMID: 24091395 DOI: 10.1109/tcbb.2013.69] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
The long-term cost of developing and maintaining a computational pipeline that depends upon data integration and sophisticated workflow logic is too high to even contemplate "what if" or ad hoc type queries. In this paper, we introduce a novel application building interface for computational biology research, called VizBuilder, by leveraging a recent query language called BioFlow for life sciences databases. Using VizBuilder, it is now possible to develop ad hoc complex computational biology applications at throw away costs. The underlying query language supports data integration and workflow construction almost transparently and fully automatically, using a best effort approach. Users express their application by drawing it with VizBuilder icons and connecting them in a meaningful way. Completed applications are compiled and translated as BioFlow queries for execution by the data management system LifeDB, for which VizBuilder serves as a front end. We discuss VizBuilder features and functionalities in the context of a real life application after we briefly introduce BioFlow. The architecture and design principles of VizBuilder are also discussed. Finally, we outline future extensions of VizBuilder. To our knowledge, VizBuilder is a unique system that allows visually designing computational biology pipelines involving distributed and heterogeneous resources in an ad hoc manner.
Collapse
|
5
|
Grupcev V, Yuan Y, Tu YC, Huang J, Chen S, Pandit S, Weng M. Approximate Algorithms for Computing Spatial Distance Histograms with Accuracy Guarantees. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2012; 25:1982-1996. [PMID: 24693210 PMCID: PMC3969837 DOI: 10.1109/tkde.2012.149] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Particle simulation has become an important research tool in many scientific and engineering fields. Data generated by such simulations impose great challenges to database storage and query processing. One of the queries against particle simulation data, the spatial distance histogram (SDH) query, is the building block of many high-level analytics, and requires quadratic time to compute using a straightforward algorithm. Previous work has developed efficient algorithms that compute exact SDHs. While beating the naive solution, such algorithms are still not practical in processing SDH queries against large-scale simulation data. In this paper, we take a different path to tackle this problem by focusing on approximate algorithms with provable error bounds. We first present a solution derived from the aforementioned exact SDH algorithm, and this solution has running time that is unrelated to the system size N. We also develop a mathematical model to analyze the mechanism that leads to errors in the basic approximate algorithm. Our model provides insights on how the algorithm can be improved to achieve higher accuracy and efficiency. Such insights give rise to a new approximate algorithm with improved time/accuracy tradeoff. Experimental results confirm our analysis.
Collapse
Affiliation(s)
- Vladimir Grupcev
- Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB 118, Tampa, FL 33620.
| | - Yongke Yuan
- Department of Industrial and Management Systems Engineering, University of South Florida, 4202 E. Fowler Ave., ENB118, Tampa, FL 33620.
| | - Yi-Cheng Tu
- Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB 118, Tampa, FL 33620.
| | - Jin Huang
- Department of Computer Science, University of Texas at Arlington, 500 UTA Boulevard, Room 640, ERB Buildings, Arlington, TX 76019.
| | - Shaoping Chen
- Department of Mathematics, Wuhan University of Technology, 122 Luosi Road, Wuhan, Hubei 430070, P.R. China.
| | - Sagar Pandit
- Department of Physics, University of South Florida, 4202 E. Fowler Ave., PHY114, Tampa, FL 33620.
| | - Michael Weng
- Department of Industrial and Management Systems Engineering, University of South Florida, 4202 E. Fowler Ave., ENB118, Tampa, FL 33620.
| |
Collapse
|
6
|
Chen S, Tu YC, Xia Y. Performance analysis of a dual-tree algorithm for computing spatial distance histograms. THE VLDB JOURNAL : VERY LARGE DATA BASES : A PUBLICATION OF THE VLDB ENDOWMENT 2011; 20:471-494. [PMID: 21804753 PMCID: PMC3145372 DOI: 10.1007/s00778-010-0205-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Many scientific and engineering fields produce large volume of spatiotemporal data. The storage, retrieval, and analysis of such data impose great challenges to database systems design. Analysis of scientific spatiotemporal data often involves computing functions of all point-to-point interactions. One such analytics, the Spatial Distance Histogram (SDH), is of vital importance to scientific discovery. Recently, algorithms for efficient SDH processing in large-scale scientific databases have been proposed. These algorithms adopt a recursive tree-traversing strategy to process point-to-point distances in the visited tree nodes in batches, thus require less time when compared to the brute-force approach where all pairwise distances have to be computed. Despite the promising experimental results, the complexity of such algorithms has not been thoroughly studied. In this paper, we present an analysis of such algorithms based on a geometric modeling approach. The main technique is to transform the analysis of point counts into a problem of quantifying the area of regions where pairwise distances can be processed in batches by the algorithm. From the analysis, we conclude that the number of pairwise distances that are left to be processed decreases exponentially with more levels of the tree visited. This leads to the proof of a time complexity lower than the quadratic time needed for a brute-force algorithm and builds the foundation for a constant-time approximate algorithm. Our model is also general in that it works for a wide range of point spatial distributions, histogram types, and space-partitioning options in building the tree.
Collapse
Affiliation(s)
- Shaoping Chen
- Department of Mathematics, Wuhan University of Technology, 122 Luosi Road, 430070 Wuhan, Hubei, People’s Republic of China
| | - Yi-Cheng Tu
- Department of Computer Science and Engineering, The University of South Florida, 4202 E. Fowler Ave., ENB118, Tampa, FL 33620, USA
| | - Yuni Xia
- Computer and Information Science Department, Indiana University-Purdue University Indianapolis, 723 W. Michigan St., SL280, Indianapolis, IN 46202, USA
| |
Collapse
|