1
|
Lam V, Sharma S, Gupta S, Spouge JL, Jordan IK, Mariño-Ramírez L. Ancestry-attenuated effects of socioeconomic deprivation on type 2 diabetes disparities in the All of Us cohort. BMC Glob Public Health 2023; 1:22. [PMID: 38045036 PMCID: PMC10693462 DOI: 10.1186/s44263-023-00025-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 09/28/2023] [Indexed: 12/05/2023]
Abstract
Background Diabetes is a common disease with a major burden on morbidity, mortality, and productivity. Type 2 diabetes (T2D) accounts for roughly 90% of all diabetes cases in the USA and has a greater observed prevalence among those who identify as Black or Hispanic. Methods This study aimed to assess T2D racial and ethnic disparities using the All of Us Research Program data and to measure associations between genetic ancestry (GA), socioeconomic deprivation, and T2D. We used the All of Us Researcher Workbench to analyze T2D prevalence and model its associations with GA, individual-level (iSDI), and zip code-based (zSDI) socioeconomic deprivation indices among participant self-identified race and ethnicity (SIRE) groups. Results The study cohort of 86,488 participants from the four largest SIRE groups in All of Us: Asian (n = 2311), Black (n = 16,282), Hispanic (n = 16,966), and White (n = 50,292). SIRE groups show characteristic genetic ancestry patterns, consistent with their diverse origins, together with a continuum of ancestry fractions within and between groups. The Black and Hispanic groups show the highest levels of socioeconomic deprivation, followed by the Asian and White groups. Black participants show the highest age- and sex-adjusted T2D prevalence (21.9%), followed by the Hispanic (19.9%), Asian (15.1%), and White (14.8%) groups. Minority SIRE groups and socioeconomic deprivation, both iSDI and zSDI, are positively associated with T2D, when the entire cohort is analyzed together. However, SIRE and GA both show negative interaction effects with iSDI and zSDI on T2D. Higher levels of iSDI and zSDI are negatively associated with T2D in the Black and Hispanic groups, and higher levels of iSDI and zSDI are negatively associated with T2D at high levels of African and Native American ancestry. Conclusions Socioeconomic deprivation is associated with a higher prevalence of T2D in Black and Hispanic minority groups, compared to the majority White group. Nonetheless, socioeconomic deprivation is associated with reduced T2D risk within the Black and Hispanic groups. These results are paradoxical and have not been reported elsewhere, with possible explanations related to the nature of the All of Us data along with SIRE group differences in access to healthcare, diet, and lifestyle.
Collapse
Affiliation(s)
- Vincent Lam
- National Institute on Minority Health and Health Disparities, National Institutes of Health, 11545 Rockville Pike, Building 11545 Rockville Pike, 2WF Room C14, Rockville, MD 20818, USA
| | - Shivam Sharma
- National Institute on Minority Health and Health Disparities, National Institutes of Health, 11545 Rockville Pike, Building 11545 Rockville Pike, 2WF Room C14, Rockville, MD 20818, USA
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
| | - Sonali Gupta
- National Institute on Minority Health and Health Disparities, National Institutes of Health, 11545 Rockville Pike, Building 11545 Rockville Pike, 2WF Room C14, Rockville, MD 20818, USA
| | - John L. Spouge
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - I. King Jordan
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
| | - Leonardo Mariño-Ramírez
- National Institute on Minority Health and Health Disparities, National Institutes of Health, 11545 Rockville Pike, Building 11545 Rockville Pike, 2WF Room C14, Rockville, MD 20818, USA
| |
Collapse
|
2
|
Lam V, Sharma S, Gupta S, Spouge JL, Jordan IK, Mariño-Ramírez L. Ancestry-attenuated effects of socioeconomic deprivation on type 2 diabetes disparities in the All of Us cohort. Res Sq 2023:rs.3.rs-2976764. [PMID: 37790565 PMCID: PMC10543018 DOI: 10.21203/rs.3.rs-2976764/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Background Diabetes is a common disease with a major burden on morbidity, mortality, and productivity. Type 2 diabetes (T2D) accounts for roughly 90% of all diabetes cases in the United States and has greater observed prevalence among those who identify as Black or Hispanic. Methods The aims of this study were to determine whether T2D racial and ethnic disparities can be observed in data from the All of Us Research Program and to measure associations of genetic ancestry (GA) and socioeconomic deprivation with T2D. The All of Us Researcher Workbench was used to calculate T2D prevalence and to model T2D associations with GA, individual-level (iSDI) and zip code-based (zSDI) socioeconomic deprivation indices within and between participant self-identified race and ethnicity (SIRE) groups. Results The study cohort of 86,488 participants from the four largest SIRE groups in All of Us: Asian (n=2,311), Black (n=16,282), Hispanic (n=16,966), and White (n=50,292). SIRE groups show characteristic genetic ancestry patterns, consistent with their diverse origins, together with a continuum of ancestry fractions within and between groups. The Black and Hispanic groups show the highest median SDI values, followed by the Asian and White groups. Black participants show the highest age- and sex-adjusted T2D prevalence (21.9%), followed by the Hispanic (19.9%), Asian (15.1%), and White (14.8%) groups. Minority SIRE groups and socioeconomic deprivation are positively associated with T2D, when the entire cohort is analyzed together. However, SIRE and GA both show negative interaction effects with SDI on T2D. Higher levels of SDI are negatively associated with T2D in the Black and Hispanic groups, and higher levels of SDI are negatively associated with T2D at high levels of African and Native American ancestry. Conclusion Socioeconomic deprivation is positively associated with the SIRE group T2D disparities observed here but negatively associated with T2D within the Black and Hispanic groups that show the highest T2D prevalence. These results are paradoxical and have not been reported elsewhere. We discuss possible explanations for this paradox related to the nature of the All of Us data along with SIRE group differences in access to healthcare, diet, and lifestyle.
Collapse
|
3
|
Stanke Z, Spouge JL. Estimating age-stratified transmission and reproduction numbers during the early exponential phase of an epidemic: A case study with COVID-19 data. Epidemics 2023; 44:100714. [PMID: 37595401 PMCID: PMC10528737 DOI: 10.1016/j.epidem.2023.100714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Revised: 06/07/2023] [Accepted: 08/08/2023] [Indexed: 08/20/2023] Open
Abstract
In a pending pandemic, early knowledge of age-specific disease parameters, e.g., susceptibility, infectivity, and the clinical fraction (the fraction of infections coming to clinical attention), supports targeted public health responses like school closures or sequestration of the elderly. The earlier the knowledge, the more useful it is, so the present article examines an early phase of many epidemics, exponential growth. Using age-stratified COVID-19 case counts collected in Canada, China, Israel, Italy, the Netherlands, and the United Kingdom before April 23, 2020, we present a linear analysis of the exponential phase that attempts to estimate the age-specific disease parameters given above. Some combinations of the parameters can be estimated by requiring that they change smoothly with age. The estimation yielded: (1) the case susceptibility, defined for each age-group as the product of susceptibility to infection and the clinical fraction; (2) the mean number of transmissions of infection per contact within each age-group; and (3) the reproduction number of infection within each age-group, i.e., the diagonal of the age-stratified next-generation matrix. Our restriction to data from the exponential phase indicates the combinations of epidemic parameters that are intrinsically easiest to estimate with early age-stratified case counts. For example, conclusions concerning the age-dependence of case susceptibility appeared more robust than corresponding conclusions about infectivity. Generally, the analysis produced some results consistent with conclusions confirmed much later in the COVID-19 pandemic. Notably, our analysis showed that in some countries, the reproduction number of infection within the half-decade 70-75 was unusually large compared to other half-decades. Our analysis therefore could have anticipated that without countermeasures, COVID-19 would spread rapidly once seeded in homes for the elderly.
Collapse
Affiliation(s)
- Zachary Stanke
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| |
Collapse
|
4
|
Frith MC, Shaw J, Spouge JL. How to optimally sample a sequence for rapid analysis. Bioinformatics 2023; 39:7005197. [PMID: 36702468 PMCID: PMC9907223 DOI: 10.1093/bioinformatics/btad057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Accepted: 01/24/2023] [Indexed: 01/28/2023] Open
Abstract
MOTIVATION We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin C Frith
- Artificial Intelligence Research Center, AIST, Tokyo 135-0064, Japan.,Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba 277-8568, Japan.,Computational Bio Big-Data Open Innovation Laboratory, AIST, Tokyo 169-8555, Japan
| | - Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, ON M5S 2E4, Canada
| | - John L Spouge
- National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
5
|
Spouge JL. A closed formula relevant to 'Theory of local k-mer selection with applications to long-read alignment' by Jim Shaw and Yun William Yu. Bioinformatics 2022; 38:4848-4849. [PMID: 36063041 PMCID: PMC9801975 DOI: 10.1093/bioinformatics/btac604] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 07/11/2022] [Accepted: 09/01/2022] [Indexed: 01/05/2023] Open
|
6
|
Spouge JL. A comprehensive estimation of country-level basic reproduction numbers R0 for COVID-19: Regime regression can automatically estimate the end of the exponential phase in epidemic data. PLoS One 2021; 16:e0254145. [PMID: 34255772 PMCID: PMC8277067 DOI: 10.1371/journal.pone.0254145] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 06/18/2021] [Indexed: 12/30/2022] Open
Abstract
In a compartmental epidemic model, the initial exponential phase reflects a fixed interaction between an infectious agent and a susceptible population in steady state, so it determines the basic reproduction number R0 on its own. After the exponential phase, dynamic complexities like societal responses muddy the practical interpretation of many estimated parameters. The computer program ARRP, already available from sequence alignment applications, automatically estimated the end of the exponential phase in COVID-19 and extracted the exponential growth rate r for 160 countries. By positing a gamma-distributed generation time, the exponential growth method then yielded R0 estimates for COVID-19 in 160 countries. The use of ARRP ensured that the R0 estimates were largely freed from any dependency outside the exponential phase. The Prem matrices quantify rates of effective contact for infectious disease. Without using any age-stratified COVID-19 data, but under strong assumptions about the homogeneity of susceptibility, infectiousness, etc., across different age-groups, the Prem contact matrices also yielded theoretical R0 estimates for COVID-19 in 152 countries, generally in quantitative conflict with the R0 estimates derived from the exponential growth method. An exploratory analysis manipulating only the Prem contact matrices reduced the conflict, suggesting that age-groups under 20 years did not promote the initial exponential growth of COVID-19 as much as other age-groups. The analysis therefore supports tentatively and tardily, but independently of age-stratified COVID-19 data, the low priority given to vaccinating younger age groups. It also supports the judicious reopening of schools. The exploratory analysis also supports the possibility of suspecting differences in epidemic spread among different age-groups, even before substantial amounts of age-stratified data become available.
Collapse
Affiliation(s)
- John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|
7
|
Martín MP, Daniëls PP, Erickson D, Spouge JL. Correction: Figures of merit and statistics for detecting faulty species identification with DNA barcodes: A case study in Ramaria and related fungal genera. PLoS One 2021; 16:e0250030. [PMID: 33826666 PMCID: PMC8026041 DOI: 10.1371/journal.pone.0250030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
8
|
Melzak KA, Spouge JL, Boecker C, Kirschhöfer F, Brenner-Weiss G, Bieback K. Hemolysis Pathways during Storage of Erythrocytes and Inter-Donor Variability in Erythrocyte Morphology. Transfus Med Hemother 2021; 48:39-47. [PMID: 33708051 DOI: 10.1159/000508711] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 05/03/2020] [Indexed: 01/10/2023] Open
Abstract
Background Red blood cells (RBCs) stored for transfusions can lyse over the course of the storage period. The lysis is traditionally assumed to occur via the formation of spiculated echinocyte forms, so that cells that appear smoother are assumed to have better storage quality. We investigate this hypothesis by comparing the morphological distribution to the hemolysis for samples from different donors. Methods Red cell concentrates were obtained from a regional blood bank quality control laboratory. Out of 636 units processed by the laboratory, we obtained 26 high hemolysis units and 24 low hemolysis units for assessment of RBC morphology. The association between the morphology and the hemolysis was tested with the Wilcoxon-Mann-Whitney U test. Results Samples with high stomatocyte counts (p = 0.0012) were associated with increased hemolysis, implying that cells can lyse via the formation of stomatocytes. Conclusion RBCs can lyse without significant echinocyte formation. Lower degrees of spiculation are not a good indicator of low hemolysis when RBCs from different donors are compared.
Collapse
Affiliation(s)
- Kathryn A Melzak
- Institute of Functional Interfaces, Karlsruhe Institute of Technology, Eggenstein-Leopoldshafen, Germany
| | - John L Spouge
- National Center for Biotechnology Information, National Institutes of Health USA, Bethesda, Maryland, USA
| | - Clemens Boecker
- Institute of Functional Interfaces, Karlsruhe Institute of Technology, Eggenstein-Leopoldshafen, Germany
| | - Frank Kirschhöfer
- Institute of Functional Interfaces, Karlsruhe Institute of Technology, Eggenstein-Leopoldshafen, Germany
| | - Gerald Brenner-Weiss
- Institute of Functional Interfaces, Karlsruhe Institute of Technology, Eggenstein-Leopoldshafen, Germany
| | - Karen Bieback
- Institute for Transfusion Medicine and Immunology, Flowcore Mannheim, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| |
Collapse
|
9
|
Spouge JL, Ziegelbauer JM, Gonzalez M. A linear-time algorithm that avoids inverses and computes Jackknife (leave-one-out) products like convolutions or other operators in commutative semigroups. Algorithms Mol Biol 2020; 15:17. [PMID: 32968428 PMCID: PMC7502207 DOI: 10.1186/s13015-020-00178-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Accepted: 09/08/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Data about herpesvirus microRNA motifs on human circular RNAs suggested the following statistical question. Consider independent random counts, not necessarily identically distributed. Conditioned on the sum, decide whether one of the counts is unusually large. Exact computation of the p-value leads to a specific algorithmic problem. Given n elements g 0 , g 1 , … , g n - 1 in a set G with the closure and associative properties and a commutative product without inverses, compute the jackknife (leave-one-out) products g ¯ j = g 0 g 1 ⋯ g j - 1 g j + 1 ⋯ g n - 1 ( 0 ≤ j < n ). RESULTS This article gives a linear-time Jackknife Product algorithm. Its upward phase constructs a standard segment tree for computing segment products like g i , j = g i g i + 1 ⋯ g j - 1 ; its novel downward phase mirrors the upward phase while exploiting the symmetry of g j and its complement g ¯ j . The algorithm requires storage for 2 n elements of G and only about 3 n products. In contrast, the standard segment tree algorithms require about n products for construction and log 2 n products for calculating each g ¯ j , i.e., about n log 2 n products in total; and a naïve quadratic algorithm using n - 2 element-by-element products to compute each g ¯ j requires n n - 2 products. CONCLUSIONS In the herpesvirus application, the Jackknife Product algorithm required 15 min; standard segment tree algorithms would have taken an estimated 3 h; and the quadratic algorithm, an estimated 1 month. The Jackknife Product algorithm has many possible uses in bioinformatics and statistics.
Collapse
|
10
|
Martín MP, Daniëls PP, Erickson D, Spouge JL. Figures of merit and statistics for detecting faulty species identification with DNA barcodes: A case study in Ramaria and related fungal genera. PLoS One 2020; 15:e0237507. [PMID: 32813726 PMCID: PMC7437900 DOI: 10.1371/journal.pone.0237507] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2019] [Accepted: 07/28/2020] [Indexed: 11/19/2022] Open
Abstract
DNA barcoding can identify biological species and provides an important tool in diverse applications, such as conserving species and identifying pathogens, among many others. If combined with statistical tests, DNA barcoding can focus taxonomic scrutiny onto anomalous species identifications based on morphological features. Accordingly, we put nonparametric tests into a taxonomic context to answer questions about our sequence dataset of the formal fungal barcode, the nuclear ribosomal internal transcribed spacer (ITS). For example, does DNA barcoding concur with annotated species identifications significantly better if expert taxonomists produced the annotations? Does species assignment improve significantly if sequences are restricted to lengths greater than 500 bp? Both questions require a figure of merit to measure of the accuracy of species identification, typically provided by the probability of correct identification (PCI). Many articles on DNA barcoding use variants of PCI to measure the accuracy of species identification, but do not provide the variants with names, and the absence of explicit names hinders the recognition that the different variants are not comparable from study to study. We provide four variant PCIs with a name and show that for fixed data they follow systematic inequalities. Despite custom, therefore, their comparison is at a minimum problematic. Some popular PCI variants are particularly vulnerable to errors in species annotation, insensitive to improvements in a barcoding pipeline, and unable to predict identification accuracy as a database grows, making them unsuitable for many purposes. Generally, the Fractional PCI has the best properties as a figure of merit for species identification. The fungal genus Ramaria provides unusual taxonomic difficulties. As a case study, it shows that a good taxonomic background can be combined with the pertinent summary statistics of molecular results to improve the identification of doubtful samples, linking both disciplines synergistically.
Collapse
Affiliation(s)
- María P. Martín
- Department of Mycology, Real Jardín Botánico-CSIC, Madrid, Spain
| | - Pablo P. Daniëls
- Department of Botany, Ecology and Plant Physiology, Campus Rabanales, University of Córdoba, Córdoba, Spain
| | - David Erickson
- Joint Institute of Food Safety and Applied Nutrition, University of Maryland, College Park, Maryland, United States of America
| | - John L. Spouge
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland, United States of America
| |
Collapse
|
11
|
Patel V, Spouge JL. Estimating the basic reproduction number of a pathogen in a single host when only a single founder successfully infects. PLoS One 2020; 15:e0227127. [PMID: 31923263 PMCID: PMC6953795 DOI: 10.1371/journal.pone.0227127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Accepted: 12/12/2019] [Indexed: 11/27/2022] Open
Abstract
If viruses or other pathogens infect a single host, the outcome of infection may depend on the initial basic reproduction number R0, the expected number of host cells infected by a single infected cell. This article shows that sometimes, phylogenetic models can estimate the initial R0, using only sequences sampled from the pathogenic population during its exponential growth or shortly thereafter. When evaluated by simulations mimicking the bursting viral reproduction of HIV and simultaneous sampling of HIV gp120 sequences during early viremia, the estimated R0 displayed useful accuracies in achievable experimental designs. Estimates of R0 have several potential applications to investigators interested in the progress of infection in single hosts, including: (1) timing a pathogen’s movement through different microenvironments; (2) timing the change points in a pathogen’s mode of spread (e.g., timing the change from cell-free spread to cell-to-cell spread, or vice versa, in an HIV infection); (3) quantifying the impact different initial microenvironments have on pathogens (e.g., in mucosal challenge with HIV, quantifying the impact that the presence or absence of mucosal infection has on R0); (4) quantifying subtle changes in infectability in therapeutic trials (either human or animal), even when therapies do not produce total sterilizing immunity; and (5) providing a variable predictive of the clinical efficacy of prophylactic therapies.
Collapse
Affiliation(s)
- Vruj Patel
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - John L. Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|
12
|
Carroll HD, Spouge JL, Gonzalez M. MultiDomainBenchmark: a multi-domain query and subject database suite. BMC Bioinformatics 2019; 20:77. [PMID: 30764761 PMCID: PMC6376684 DOI: 10.1186/s12859-019-2660-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Accepted: 01/28/2019] [Indexed: 11/10/2022] Open
Abstract
Background Genetic sequence database retrieval benchmarks play an essential role in evaluating the performance of sequence searching tools. To date, all phylogenetically diverse benchmarks known to the authors include only query sequences with single protein domains. Domains are the primary building blocks of protein structure and function. Independently, each domain can fulfill a single function, but most proteins (>80% in Metazoa) exist as multi-domain proteins. Multiple domain units combine in various arrangements or architectures to create different functions and are often under evolutionary pressures to yield new ones. Thus, it is crucial to create gold standards reflecting the multi-domain complexity of real proteins to more accurately evaluate sequence searching tools. Description This work introduces MultiDomainBenchmark (MDB), a database suite of 412 curated multi-domain queries and 227,512 target sequences, representing at least 5108 species and 1123 phylogenetically divergent protein families, their relevancy annotation, and domain location. Here, we use the benchmark to evaluate the performance of two commonly used sequence searching tools, BLAST/PSI-BLAST and HMMER. Additionally, we introduce a novel classification technique for multi-domain proteins to evaluate how well an algorithm recovers a domain architecture. Conclusion MDB is publicly available at http://csc.columbusstate.edu/carroll/MDB/. Electronic supplementary material The online version of this article (10.1186/s12859-019-2660-5) contains supplementary material, which is available to authorized users.
Collapse
|
13
|
Abstract
An RNA switch triggers biological functions by toggling between two conformations. RNA switches include bacterial riboswitches, where ligand binding can stabilize a bound structure. For RNAs with only one stable structure, structural prediction usually just requires a straightforward free energy minimization, but for an RNA switch, the prediction of a less stable alternative structure is often computationally costly and even problematic. The current sampling-clustering method predicts stable and alternative structures by partitioning structures sampled from the energy landscape into two clusters, but it is very time-consuming. Instead, we predict the alternative structure of an RNA switch from conditional probability calculations within the energy landscape. First, our method excludes base pairs related to the most stable structure in the energy landscape. Then, it detects stable stems (“seeds”) in the remaining landscape. Finally, it folds an alternative structure prediction around a seed. While having comparable riboswitch classification performance, the conditional-probability computations had fewer adjustable parameters, offered greater predictive flexibility, and were more than one thousand times faster than the sampling step alone in sampling-clustering predictions, the competing standard. Overall, the described approach helps traverse thermodynamically improbable energy landscapes to find biologically significant substructures and structures rapidly and effectively.
Collapse
Affiliation(s)
- Amirhossein Manzourolajdad
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail:
| | - John L. Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|
14
|
Spouge JL. An accurate approximation for the expected site frequency spectrum in a Galton-Watson process under an infinite sites mutation model. Theor Popul Biol 2019; 127:7-15. [PMID: 30876864 DOI: 10.1016/j.tpb.2019.03.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 03/04/2019] [Accepted: 03/05/2019] [Indexed: 01/26/2023]
Abstract
If viruses or other pathogens infect a single host, the outcome of infection often hinges on the fate of the initial invaders. The initial basic reproduction number R0, the expected number of cells infected by a single infected cell, helps determine whether the initial viruses can establish a successful beachhead. To determine R0, the Kingman coalescent or continuous-time birth-and-death process can be used to infer the rate of exponential growth in an historical population. Given M sequences sampled in the present, the two models can make the inference from the site frequency spectrum (SFS), the count of mutations that appear in exactly k sequences (k=1,2,…,M). In the case of viruses, however, if R0 is large and an infected cell bursts while propagating virus, the two models are suspect, because they are Markovian with only binary branching. Accordingly, this article develops an approximation for the SFS of a discrete-time branching process with synchronous generations (i.e., a Galton-Watson process). When evaluated in simulations with an asynchronous, non-Markovian model (a Bellman-Harris process) with parameters intended to mimic the bursting viral reproduction of HIV, the approximation proved superior to approximations derived from the Kingman coalescent or continuous-time birth-and-death process. This article demonstrates that in analogy to methods in human genetics, the SFS of viral sequences sampled well after latent infection can remain informative about the initial R0. Thus, it suggests the utility of analyzing the SFS of sequences derived from patient and animal trials of viral therapies, because in some cases, the initial R0 may be able to indicate subtle therapeutic progress, even in the absence of statistically significant differences in the infection of treatment and control groups.
Collapse
Affiliation(s)
- John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
15
|
Tang K, Ren J, Cronn R, Erickson DL, Milligan BG, Parker-Forney M, Spouge JL, Sun F. Alignment-free genome comparison enables accurate geographic sourcing of white oak DNA. BMC Genomics 2018; 19:896. [PMID: 30526482 PMCID: PMC6288960 DOI: 10.1186/s12864-018-5253-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 11/15/2018] [Indexed: 01/14/2023] Open
Abstract
Background The application of genomic data and bioinformatics for the identification of restricted or illegally-sourced natural products is urgently needed. The taxonomic identity and geographic provenance of raw and processed materials have implications in sustainable-use commercial practices, and relevance to the enforcement of laws that regulate or restrict illegally harvested materials, such as timber. Improvements in genomics make it possible to capture and sequence partial-to-complete genomes from challenging tissues, such as wood and wood products. Results In this paper, we report the success of an alignment-free genome comparison method, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {d}_2^{\ast }, $$\end{document}d2∗, that differentiates different geographic sources of white oak (Quercus) species with a high level of accuracy with very small amount of genomic data. The method is robust to sequencing errors, different sequencing laboratories and sequencing platforms. Conclusions This method offers an approach based on genome-scale data, rather than panels of pre-selected markers for specific taxa. The method provides a generalizable platform for the identification and sourcing of materials using a unified next generation sequencing and analysis framework. Electronic supplementary material The online version of this article (10.1186/s12864-018-5253-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kujin Tang
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Jie Ren
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Richard Cronn
- Pacific Northwest Research Station, USDA Forest Service, Corvallis, OR, 97331, USA.
| | - David L Erickson
- DNA4 Technologies LLC, bwtech@UMBC Research & Technology Park, Baltimore, MD, 21227, USA
| | - Brook G Milligan
- Conservation Genomics Laboratory, Department of Biology, New Mexico State University, Las Cruces, NM, 88003, USA
| | | | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA. .,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China.
| |
Collapse
|
16
|
Gauran IIM, Park J, Lim J, Park D, Zylstra J, Peterson T, Kann M, Spouge JL. Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data. Biometrics 2017; 74:458-471. [PMID: 28940296 DOI: 10.1111/biom.12779] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2016] [Revised: 08/01/2017] [Accepted: 08/01/2017] [Indexed: 11/28/2022]
Abstract
In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero-inflated Generalized Poisson (ZIGP) distribution. Furthermore, we assumed that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed two-stage testing procedure has superior empirical power.
Collapse
Affiliation(s)
- Iris Ivy M Gauran
- Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.,School of Statistics, University of the Philippines Diliman, Quezon City, 1101, Philippines
| | - Junyong Park
- Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A
| | - Johan Lim
- Department of Statistics, Seoul National University, Seoul, 08826, Republic of Korea
| | - DoHwan Park
- Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A
| | - John Zylstra
- Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A
| | - Thomas Peterson
- Department of Biological Sciences, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A
| | - Maricel Kann
- Department of Biological Sciences, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, U.S.A
| |
Collapse
|
17
|
Gonzalez M, DeVico AL, Spouge JL. Conserved signatures indicate HIV-1 transmission is under strong selection and thus is not a "stochastic" process. Retrovirology 2017; 14:13. [PMID: 28231858 PMCID: PMC5324211 DOI: 10.1186/s12977-016-0326-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2016] [Accepted: 12/22/2016] [Indexed: 11/23/2022] Open
Abstract
Recently, Oberle et al. published a paper in Retrovirology evaluating the question of whether selection plays a role in HIV transmission. The Oberle study found no obvious genotypic or phenotypic differences between donors and recipients of epidemiologically linked pairs from the Swiss cohort. Thus, Oberle et al. characterized HIV-1 B transmission as largely “stochastic”, an imprecise and potentially misleading term. Here, we re-analyzed their data and placed them in the context of transmission data for over 20 other human and animal trials. The present study finds that the transmitted/founder (T/F) viruses from the Swiss cohort show the same non-random genetic signatures conserved in 118 HIV-1, 40 SHIV, and 12 SIV T/F viruses previously published by two independent groups. We provide alternative interpretations of the Swiss cohort data and conclude that the sequences of their donor viruses lacked variability at the specific sites where other studies were able to demonstrate genotypic selection. Oberle et al. observed no phenotypic selection in vitro, so the problem of determining the in vivo phenotypic mechanisms that cause genotypic selection in HIV remains open.
Collapse
Affiliation(s)
- Mileidy Gonzalez
- Statistical Computational Biology Group, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| | - Anthony L DeVico
- Division of Basic Science and Vaccine Research, Institute of Human Virology, University of Maryland School of Medicine, Baltimore, MD, USA
| | - John L Spouge
- Statistical Computational Biology Group, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
18
|
Acevedo-Luna N, Mariño-Ramírez L, Halbert A, Hansen U, Landsman D, Spouge JL. Most of the tight positional conservation of transcription factor binding sites near the transcription start site reflects their co-localization within regulatory modules. BMC Bioinformatics 2016; 17:479. [PMID: 27871221 PMCID: PMC5117513 DOI: 10.1186/s12859-016-1354-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Accepted: 11/11/2016] [Indexed: 11/24/2022] Open
Abstract
Background Transcription factors (TFs) form complexes that bind regulatory modules (RMs) within DNA, to control specific sets of genes. Some transcription factor binding sites (TFBSs) near the transcription start site (TSS) display tight positional preferences relative to the TSS. Furthermore, near the TSS, RMs can co-localize TFBSs with each other and the TSS. The proportion of TFBS positional preferences due to TFBS co-localization within RMs is unknown, however. ChIP experiments confirm co-localization of some TFBSs genome-wide, including near the TSS, but they typically examine only a few TFs at a time, using non-physiological conditions that can vary from lab to lab. In contrast, sequence analysis can examine many TFs uniformly and methodically, broadly surveying the co-localization of TFBSs with tight positional preferences relative to the TSS. Results Our statistics found 43 significant sets of human motifs in the JASPAR TF Database with positional preferences relative to the TSS, with 38 preferences tight (±5 bp). Each set of motifs corresponded to a gene group of 135 to 3304 genes, with 42/43 (98%) gene groups independently validated by DAVID, a gene ontology database, with FDR < 0.05. Motifs corresponding to two TFBSs in a RM should co-occur more than by chance alone, enriching the intersection of the gene groups corresponding to the two TFs. Thus, a gene-group intersection systematically enriched beyond chance alone provides evidence that the two TFs participate in an RM. Of the 903 = 43*42/2 intersections of the 43 significant gene groups, we found 768/903 (85%) pairs of gene groups with significantly enriched intersections, with 564/768 (73%) intersections independently validated by DAVID with FDR < 0.05. A user-friendly web site at http://go.usa.gov/3kjsH permits biologists to explore the interaction network of our TFBSs to identify candidate subunit RMs. Conclusions Gene duplication and convergent evolution within a genome provide obvious biological mechanisms for replicating an RM near the TSS that binds a particular TF subunit. Of all intersections of our 43 significant gene groups, 85% were significantly enriched, with 73% of the significant enrichments independently validated by gene ontology. The co-localization of TFBSs within RMs therefore likely explains much of the tight TFBS positional preferences near the TSS. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1354-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Natalia Acevedo-Luna
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA, 50011, USA
| | - Leonardo Mariño-Ramírez
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Armand Halbert
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Ulla Hansen
- Department of Biology, Boston University, 5 Cummington Mall, Boston, MA, 02215, USA
| | - David Landsman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
19
|
Manzourolajdad A, Gonzalez M, Spouge JL. Changes in the Plasticity of HIV-1 Nef RNA during the Evolution of the North American Epidemic. PLoS One 2016; 11:e0163688. [PMID: 27685447 PMCID: PMC5042412 DOI: 10.1371/journal.pone.0163688] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2016] [Accepted: 09/13/2016] [Indexed: 02/04/2023] Open
Abstract
Because of a high mutation rate, HIV exists as a viral swarm of many sequence variants evolving under various selective pressures from the human immune system. Although the Nef gene codes for the most immunogenic of HIV accessory proteins, which alone makes it of great interest to HIV research, it also encodes an RNA structure, whose contribution to HIV virulence has been largely unexplored. Nef RNA helps HIV escape RNA interference (RNAi) through nucleotide changes and alternative folding. This study examines Historic and Modern Datasets of patient HIV-1 Nef sequences during the evolution of the North American epidemic for local changes in RNA plasticity. By definition, RNA plasticity refers to an RNA molecule’s ability to take alternative folds (i.e., alternative conformations). Our most important finding is that an evolutionarily conserved region of the HIV-1 Nef gene, which we denote by R2, recently underwent a statistically significant increase in its RNA plasticity. Thus, our results indicate that Modern Nef R2 typically accommodates an alternative fold more readily than Historic Nef R2. Moreover, the increase in RNA plasticity resides mostly in synonymous nucleotide changes, which cannot be a response to selective pressures on the Nef protein. R2 may therefore be of interest in the development of antiviral RNAi therapies.
Collapse
Affiliation(s)
- Amirhossein Manzourolajdad
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail:
| | - Mileidy Gonzalez
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - John L. Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|
20
|
Abstract
Consider a renewal process. The renewal events partition the process into i.i.d. renewal cycles. Assume that on each cycle, a rare event called 'success’ can occur. Such successes lend themselves naturally to approximation by Poisson point processes. If each success occurs after a random delay, however, Poisson convergence may be relatively slow, because each success corresponds to a time interval, not a point. In 1996, Altschul and Gish proposed a finite-size correction to a particular approximation by a Poisson point process. Their correction is now used routinely (about once a second) when computers compare biological sequences, although it lacks a mathematical foundation. This paper generalizes their correction. For a single renewal process or several renewal processes operating in parallel, this paper gives an asymptotic expansion that contains in successive terms a Poisson point approximation, a generalization of the Altschul-Gish correction, and a correction term beyond that.
Collapse
|
21
|
Abstract
In bioinformatics, the notion of an ‘island’ enhances the efficient simulation of gapped local alignment statistics. This paper generalizes several results relevant to gapless local alignment statistics from one to higher dimensions, with a particular eye to applications in gapped alignment statistics. For example, reversal of paths (rather than of discrete time) generalizes a distributional equality, from queueing theory, between the Lindley (local sum) and maximum processes. Systematic investigation of an ‘ownership’ relationship among vertices in ℤ2 formalizes the notion of an island as a set of vertices having a common owner. Predictably, islands possess some stochastic ordering and spatial averaging properties. Moreover, however, the average number of vertices in a subcritical stationary island is 1, generalizing a theorem of Kac about stationary point processes. The generalization leads to alternative ways of simulating some island statistics.
Collapse
|
22
|
Abstract
The polydisperse coagulation equation
models irreversible aggregation of particles with varying masses. This paper uses a one-parameter family of discrete-time continuous multitype branching processes to solve the polydisperse coagulation equation when The critical time tc when diverges corresponds to a critical branching process, while post-critical times t> tc correspond to supercritical branching processes.
Collapse
|
23
|
Sheetlin S, Park Y, Frith MC, Spouge JL. ALP & FALP: C++ libraries for pairwise local alignment E-values. Bioinformatics 2015; 32:304-5. [PMID: 26428291 DOI: 10.1093/bioinformatics/btv575] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2015] [Accepted: 09/28/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Pairwise local alignment is an indispensable tool for molecular biologists. In real time (i.e. in about 1 s), ALP (Ascending Ladder Program) calculates the E-values for protein-protein or DNA-DNA local alignments of random sequences, for arbitrary substitution score matrix, gap costs and letter abundances; and FALP (Frameshift Ascending Ladder Program) performs a similar task, although more slowly, for frameshifting DNA-protein alignments. AVAILABILITY AND IMPLEMENTATION To permit other C++ programmers to implement the computational efficiencies in ALP and FALP directly within their own programs, C++ source codes are available in the public domain at http://go.usa.gov/3GTSW under 'ALP' and 'FALP', along with the standalone programs ALP and FALP. CONTACT spouge@nih.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sergey Sheetlin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| | - Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| | - Martin C Frith
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| |
Collapse
|
24
|
Tewari S, Spouge JL. Coalescent: an open-science framework for importance sampling in coalescent theory. PeerJ 2015; 3:e1203. [PMID: 26312189 PMCID: PMC4548476 DOI: 10.7717/peerj.1203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2014] [Accepted: 07/30/2015] [Indexed: 11/20/2022] Open
Abstract
Background. In coalescent theory, computer programs often use importance sampling to calculate likelihoods and other statistical quantities. An importance sampling scheme can exploit human intuition to improve statistical efficiency of computations, but unfortunately, in the absence of general computer frameworks on importance sampling, researchers often struggle to translate new sampling schemes computationally or benchmark against different schemes, in a manner that is reliable and maintainable. Moreover, most studies use computer programs lacking a convenient user interface or the flexibility to meet the current demands of open science. In particular, current computer frameworks can only evaluate the efficiency of a single importance sampling scheme or compare the efficiencies of different schemes in an ad hoc manner. Results. We have designed a general framework (http://coalescent.sourceforge.net; language: Java; License: GPLv3) for importance sampling that computes likelihoods under the standard neutral coalescent model of a single, well-mixed population of constant size over time following infinite sites model of mutation. The framework models the necessary core concepts, comes integrated with several data sets of varying size, implements the standard competing proposals, and integrates tightly with our previous framework for calculating exact probabilities. For a given dataset, it computes the likelihood and provides the maximum likelihood estimate of the mutation parameter. Well-known benchmarks in the coalescent literature validate the accuracy of the framework. The framework provides an intuitive user interface with minimal clutter. For performance, the framework switches automatically to modern multicore hardware, if available. It runs on three major platforms (Windows, Mac and Linux). Extensive tests and coverage make the framework reliable and maintainable. Conclusions. In coalescent theory, many studies of computational efficiency consider only effective sample size. Here, we evaluate proposals in the coalescent literature, to discover that the order of efficiency among the three importance sampling schemes changes when one considers running time as well as effective sample size. We also describe a computational technique called "just-in-time delegation" available to improve the trade-off between running time and precision by constructing improved importance sampling schemes from existing ones. Thus, our systems approach is a potential solution to the "2(8) programs problem" highlighted by Felsenstein, because it provides the flexibility to include or exclude various features of similar coalescent models or importance sampling schemes.
Collapse
Affiliation(s)
- Susanta Tewari
- National Center of Biotechnology Information , Bethesda, MD , United States
| | - John L Spouge
- National Center of Biotechnology Information , Bethesda, MD , United States
| |
Collapse
|
25
|
Carroll HD, Williams AC, Davis AG, Spouge JL. Improving Retrieval Efficacy of Homology Searches Using the False Discovery Rate. IEEE/ACM Trans Comput Biol Bioinform 2015; 12:531-537. [PMID: 26357264 PMCID: PMC4568567 DOI: 10.1109/tcbb.2014.2366112] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Over the past few decades, discovery based on sequence homology has become a widely accepted practice. Consequently, comparative accuracy of retrieval algorithms (e.g., BLAST) has been rigorously studied for improvement. Unlike most components of retrieval algorithms, the E-value threshold criterion has yet to be thoroughly investigated. An investigation of the threshold is important as it exclusively dictates which sequences are declared relevant and irrelevant. In this paper, we introduce the false discovery rate (FDR) statistic as a replacement for the uniform threshold criterion in order to improve efficacy in retrieval systems. Using NCBI's BLAST and PSI-BLAST software packages, we demonstrate the applicability of such a replacement in both non-iterative (BLASTFDR) and iterative (PSI-BLAST(FDR)) homology searches. For each application, we performed an evaluation of retrieval efficacy with five different multiple testing methods on a large training database. For each algorithm, we choose the best performing method, Benjamini-Hochberg, as the default statistic. As measured by the threshold average precision, BLAST(FDR) yielded 14.1 percent better retrieval performance than BLAST on a large (5,161 queries) test database and PSI-BLAST(FDR) attained 11.8 percent better retrieval performance than PSI-BLAST. The C++ source code specific to BLAST(FDR) and PSI-BLAST(FDR) and instructions are available at http://www.cs.mtsu.edu/~hcarroll/blast_fdr/.
Collapse
Affiliation(s)
- Hyrum D. Carroll
- Department of Computer Science, Middle Tennessee State University, Murfreesboro, TN, 37128
| | - Alex C. Williams
- Department of Computer Science, Middle Tennessee State University, Murfreesboro, TN, 37128
| | - Anthony G. Davis
- Department of Computer Science, Middle Tennessee State University, Murfreesboro, TN, 37128
| | - John L. Spouge
- National Center for Biotechnology Information, Bethesda, MD 20894
| |
Collapse
|
26
|
Silva JC, Egan A, Arze C, Spouge JL, Harris DG. A new method for estimating species age supports the coexistence of malaria parasites and their Mammalian hosts. Mol Biol Evol 2015; 32:1354-64. [PMID: 25589738 PMCID: PMC4408405 DOI: 10.1093/molbev/msv005] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Species in the genus Plasmodium cause malaria in humans and infect a variety of mammals and other vertebrates. Currently, estimated ages for several mammalian Plasmodium parasites differ by as much as one order of magnitude, an inaccuracy that frustrates reliable estimation of evolutionary rates of disease-related traits. We developed a novel statistical approach to dating the relative age of evolutionary lineages, based on Total Least Squares regression. We validated this lineage dating approach by applying it to the genus Drosophila. Using data from the Drosophila 12 Genomes project, our approach accurately reconstructs the age of well-established Drosophila clades, including the speciation event that led to the subgenera Drosophila and Sophophora, and age of the melanogaster species subgroup. We applied this approach to hundreds of loci from seven mammalian Plasmodium species. We demonstrate the existence of a molecular clock specific to individual Plasmodium proteins, and estimate the relative age of mammalian-infecting Plasmodium. These analyses indicate that: 1) the split between the human parasite Plasmodium vivax and P. knowlesi, from Old World monkeys, occurred 6.1 times earlier than that between P. falciparum and P. reichenowi, parasites of humans and chimpanzees, respectively; and 2) mammalian Plasmodium parasites originated 22 times earlier than the split between P. falciparum and P. reichenowi. Calibrating the absolute divergence times for Plasmodium with eukaryotic substitution rates, we show that the split between P. falciparum and P. reichenowi occurred 3.0-5.5 Ma, and that mammalian Plasmodium parasites originated over 64 Ma. Our results indicate that mammalian-infecting Plasmodium evolved contemporaneously with their hosts, with little evidence for parasite host-switching on an evolutionary scale, and provide a solid timeframe within which to place the evolution of new Plasmodium species.
Collapse
Affiliation(s)
- Joana C Silva
- Institute for Genome Sciences, University of Maryland School of Medicine Department of Microbiology and Immunology, University of Maryland School of Medicine
| | - Amy Egan
- Institute for Genome Sciences, University of Maryland School of Medicine
| | - Cesar Arze
- Institute for Genome Sciences, University of Maryland School of Medicine
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - David G Harris
- Department of Applied Mathematics and Statistics, University of Maryland, College Park
| |
Collapse
|
27
|
Abstract
MOTIVATION The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. RESULTS We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.
Collapse
Affiliation(s)
- Sergey L Sheetlin
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | - Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | - Martin C Frith
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| |
Collapse
|
28
|
Spouge JL, Mariño-Ramírez L, Sheetlin SL. Searching for repeats, as an example of using the generalised Ruzzo-Tompa algorithm to find optimal subsequences with gaps. Int J Bioinform Res Appl 2014; 10:384-408. [PMID: 24989859 DOI: 10.1504/ijbra.2014.062991] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Some biological sequences contain subsequences of unusual composition; e.g. some proteins contain DNA binding domains, transmembrane regions and charged regions, and some DNA sequences contain repeats. The linear-time Ruzzo-Tompa (RT) algorithm finds subsequences of unusual composition, using a sequence of scores as input and the corresponding 'maximal segments' as output. In principle, permitting gaps in the output subsequences could improve sensitivity. Here, the input of the RT algorithm is generalised to a finite, totally ordered, weighted graph, so the algorithm locates paths of maximal weight through increasing but not necessarily adjacent vertices. By permitting the penalised deletion of unfavourable letters, the generalisation therefore includes gaps. The program RepWords, which finds inexact simple repeats in DNA, exemplifies the general concepts by out-performing a similar extant, ad hoc tool. With minimal programming effort, the generalised Ruzzo-Tompa algorithm could improve the performance of many programs for finding biological subsequences of unusual composition.
Collapse
Affiliation(s)
- John L Spouge
- Computational Biology Branch, National Center for Biotechnology Information, Bethesda, MD 20894, USA
| | - Leonardo Mariño-Ramírez
- Computational Biology Branch, National Center for Biotechnology Information, Bethesda, MD 20894, USA
| | - Sergey L Sheetlin
- Computational Biology Branch, National Center for Biotechnology Information, Bethesda, MD 20894, USA
| |
Collapse
|
29
|
Mandoiu I, Pop M, Rajasekaran S, Spouge JL. This special issue includes a selection of papers presented at the 2nd IEEE International Conference. Introduction. Int J Bioinform Res Appl 2014; 10:341-344. [PMID: 25715438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
|
30
|
Spouge JL. Within a sample from a population, the distribution of the number of descendants of a subsample's most recent common ancestor. Theor Popul Biol 2013; 92:51-4. [PMID: 24321308 DOI: 10.1016/j.tpb.2013.11.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2013] [Revised: 11/23/2013] [Accepted: 11/26/2013] [Indexed: 10/25/2022]
Abstract
Sample n individuals uniformly at random from a population, and then sample m individuals uniformly at random from the sample. Consider the most recent common ancestor (MRCA) of the subsample of m individuals. Let the subsample MRCA have j descendants in the sample (m ⩽ j ⩽ n). Under a Moran or coalescent model (and therefore under many other models), the probability that j = n is known. In this case, the subsample MRCA is an ancestor of every sampled individual, and the subsample and sample MRCAs are identical. The probability that j = m is also known. In this case, the subsample MRCA is an ancestor of no sampled individual outside the subsample. This article derives the complete distribution of j, enabling inferences from the corresponding p-value. The text presents hypothetical statistical applications pertinent to taxonomy (the gene flow between Neanderthals and anatomically modern humans) and medicine (the association of genetic markers with disease).
Collapse
Affiliation(s)
- John L Spouge
- Building 38A, Room 6N603, National Center for Biotechnology Information, Bethesda MD 20894, United States.
| |
Collapse
|
31
|
Gonzalez MW, Spouge JL. Domain analysis of symbionts and hosts (DASH) in a genome-wide survey of pathogenic human viruses. BMC Res Notes 2013; 6:209. [PMID: 23706066 PMCID: PMC3672079 DOI: 10.1186/1756-0500-6-209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2012] [Accepted: 05/17/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the coevolution of viruses and their hosts, viruses often capture host genes, gaining advantageous functions (e.g. immune system control). Identifying functional similarities shared by viruses and their hosts can help decipher mechanisms of pathogenesis and accelerate virus-targeted drug and vaccine development. Cellular homologs in viruses are usually documented using pairwise-sequence comparison methods. Yet, pairwise-sequence searches have limited sensitivity resulting in poor identification of divergent homologies. RESULTS Methods based on profiles from multiple sequences provide a more sensitive alternative to identify similarities in host-pathogen systems. The present work describes a profile-based bioinformatics pipeline that we call the Domain Analysis of Symbionts and Hosts (DASH). DASH provides a web platform for the functional analysis of viral and host genomes. This study uses Human Herpesvirus 8 (HHV-8) as a model to validate the methodology. Our results indicate that HHV-8 shares at least 29% of its genes with humans (fourteen immunomodulatory and ten metabolic genes). DASH also suggests functions for fifty-one additional HHV-8 structural and metabolic proteins. We also perform two other comparative genomics studies of human viruses: (1) a broad survey of eleven viruses of disparate sizes and transcription strategies; and (2) a closer examination of forty-one viruses of the order Mononegavirales. In the survey, DASH detects human homologs in 4/5 DNA viruses. None of the non-retro-transcribing RNA viruses in the survey showed evidence of homology to humans. The order Mononegavirales are also non-retro-transcribing RNA viruses, however, and DASH found homology in 39/41 of them. Mononegaviruses display larger fractions of human similarities (up to 75%) than any of the other RNA or DNA viruses (up to 55% and 29% respectively). CONCLUSIONS We conclude that gene sharing probably occurs between humans and both DNA and RNA viruses, in viral genomes of differing sizes, regardless of transcription strategies. Our method (DASH) simultaneously analyzes the genomes of two interacting species thereby mining functional information to identify shared as well as exclusive domains to each organism. Our results validate our approach, showing that DASH has potential as a pipeline for making therapeutic discoveries in other host-symbiont systems. DASH results are available at http://tinyurl.com/spouge-dash.
Collapse
Affiliation(s)
- Mileidy W Gonzalez
- National Institutes of Health, National Library of Medicine, National Center for Biotechnology Information, 8600 Rockville Pike, Building 38A, Room 6N611-M, Bethesda, MD 20894, USA.
| | | |
Collapse
|
32
|
Suwannasai N, Martín MP, Phosri C, Sihanonth P, Whalley AJS, Spouge JL. Fungi in Thailand: a case study of the efficacy of an ITS barcode for automatically identifying species within the Annulohypoxylon and Hypoxylon genera. PLoS One 2013; 8:e54529. [PMID: 23390499 PMCID: PMC3563529 DOI: 10.1371/journal.pone.0054529] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2012] [Accepted: 12/13/2012] [Indexed: 11/20/2022] Open
Abstract
Thailand, a part of the Indo-Burma biodiversity hotspot, has many endemic animals and plants. Some of its fungal species are difficult to recognize and separate, complicating assessments of biodiversity. We assessed species diversity within the fungal genera Annulohypoxylon and Hypoxylon, which produce biologically active and potentially therapeutic compounds, by applying classical taxonomic methods to 552 teleomorphs collected from across Thailand. Using probability of correct identification (PCI), we also assessed the efficacy of automated species identification with a fungal barcode marker, ITS, in the model system of Annulohypoxylon and Hypoxylon. The 552 teleomorphs yielded 137 ITS sequences; in addition, we examined 128 GenBank ITS sequences, to assess biases in evaluating a DNA barcode with GenBank data. The use of multiple sequence alignment in a barcode database like BOLD raises some concerns about non-protein barcode markers like ITS, so we also compared species identification using different alignment methods. Our results suggest the following. (1) Multiple sequence alignment of ITS sequences is competitive with pairwise alignment when identifying species, so BOLD should be able to preserve its present bioinformatics workflow for species identification for ITS, and possibly therefore with at least some other non-protein barcode markers. (2) Automated species identification is insensitive to a specific choice of evolutionary distance, contributing to resolution of a current debate in DNA barcoding. (3) Statistical methods are available to address, at least partially, the possibility of expert misidentification of species. Phylogenetic trees discovered a cryptic species and strongly supported monophyletic clades for many Annulohypoxylon and Hypoxylon species, suggesting that ITS can contribute usefully to a barcode for these fungi. The PCIs here, derived solely from ITS, suggest that a fungal barcode will require secondary markers in Annulohypoxylon and Hypoxylon, however. The URL http://tinyurl.com/spouge-barcode contains computer programs and other supplementary material relevant to this article.
Collapse
Affiliation(s)
- Nuttika Suwannasai
- Department of Biology, Faculty of Science, Srinakharinwirot University, Bangkok, Thailand
| | - María P. Martín
- Department of Mycology, Real Jardín Botánico-CSIC, Plaza de Murillo 2, Madrid, Spain
| | - Cherdchai Phosri
- Microbiology Programme, Faculty of Science and Technology, Pibulsongkram Rajabhat University, Phitsanulok, Thailand
| | - Prakitsin Sihanonth
- Department of Microbiology, Faculty of Science, Chulalongkorn University, Bangkok, Thailand
| | - Anthony J. S. Whalley
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, United Kingdom
| | - John L. Spouge
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland, United States of America
| |
Collapse
|
33
|
Pawlowski J, Audic S, Adl S, Bass D, Belbahri L, Berney C, Bowser SS, Cepicka I, Decelle J, Dunthorn M, Fiore-Donno AM, Gile GH, Holzmann M, Jahn R, Jirků M, Keeling PJ, Kostka M, Kudryavtsev A, Lara E, Lukeš J, Mann DG, Mitchell EAD, Nitsche F, Romeralo M, Saunders GW, Simpson AGB, Smirnov AV, Spouge JL, Stern RF, Stoeck T, Zimmermann J, Schindel D, de Vargas C. CBOL protist working group: barcoding eukaryotic richness beyond the animal, plant, and fungal kingdoms. PLoS Biol 2012; 10:e1001419. [PMID: 23139639 PMCID: PMC3491025 DOI: 10.1371/journal.pbio.1001419] [Citation(s) in RCA: 319] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A group of protist experts proposes a two-step DNA barcoding approach, comprising a universal eukaryotic pre-barcode followed by group-specific barcodes, to unveil the hidden biodiversity of microbial eukaryotes.
Collapse
Affiliation(s)
- Jan Pawlowski
- Department of Genetics and Evolution, University of Geneva, Geneva, Switzerland
- * E-mail: (JP); (CdV)
| | - Stéphane Audic
- Centre National de la Recherche Scientifique, Unité Mixte de Recherche 7144 and Université Pierre et Marie Curie, Paris 6, Station Biologique de Roscoff, France
| | - Sina Adl
- Department of Soil Science, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - David Bass
- Department of Life Sciences, Natural History Museum, London, United Kingdom
| | - Lassaâd Belbahri
- Laboratory of Soil Biology, University of Neuchâtel, Neuchâtel, Switzerland
| | - Cédric Berney
- Department of Life Sciences, Natural History Museum, London, United Kingdom
| | - Samuel S. Bowser
- Wadsworth Center, New York State Department of Health, Albany, New York, United States of America
| | - Ivan Cepicka
- Department of Zoology, Charles University in Prague, Prague, Czech Republic
| | - Johan Decelle
- Centre National de la Recherche Scientifique, Unité Mixte de Recherche 7144 and Université Pierre et Marie Curie, Paris 6, Station Biologique de Roscoff, France
| | - Micah Dunthorn
- Department of Ecology, University of Kaiserslautern, Kaiserslautern, Germany
| | - Anna Maria Fiore-Donno
- Institute of Botany and Landscape Ecology, University of Greifswald, Greifswald, Germany
| | - Gillian H. Gile
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Maria Holzmann
- Department of Genetics and Evolution, University of Geneva, Geneva, Switzerland
| | - Regine Jahn
- Botanischer Garten und Botanischer Museum Berlin-Dahlem, Freie Universität Berlin, Berlin, Germany
| | - Miloslav Jirků
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Patrick J. Keeling
- Canadian Institute for Advanced Research, Botany Department, University of British Columbia, Vancouver, British Columbia, Canada
| | - Martin Kostka
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
- Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
| | - Alexander Kudryavtsev
- Department of Genetics and Evolution, University of Geneva, Geneva, Switzerland
- Department of Invertebrate Zoology, St-Petersburg State University, St-Petersburg, Russia
| | - Enrique Lara
- Laboratory of Soil Biology, University of Neuchâtel, Neuchâtel, Switzerland
| | - Julius Lukeš
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
- Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
| | - David G. Mann
- Royal Botanic Garden Edinburgh, Edinburgh, United Kingdom
| | | | - Frank Nitsche
- Allgemeine Ökologie, Universität zu Köln, Köln, Germany
| | - Maria Romeralo
- Department of Systematic Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
| | - Gary W. Saunders
- Department of Biology, University of New Brunswick, Fredericton, New Brunswick, Canada
| | | | - Alexey V. Smirnov
- Department of Invertebrate Zoology, St-Petersburg State University, St-Petersburg, Russia
| | - John L. Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Computational Biology Branch, Bethesda, Maryland, United States of America
| | - Rowena F. Stern
- Sir Alister Hardy Foundation for Ocean Science, Citadel Hill, Plymouth, United Kingdom
| | - Thorsten Stoeck
- Department of Ecology, University of Kaiserslautern, Kaiserslautern, Germany
| | - Jonas Zimmermann
- Botanischer Garten und Botanischer Museum Berlin-Dahlem, Freie Universität Berlin, Berlin, Germany
- Justus-Liebig-University, Giessen, Germany
| | - David Schindel
- Smithsonian Institution, National Museum of Natural History, Washington, DC, United States of America
| | - Colomban de Vargas
- Centre National de la Recherche Scientifique, Unité Mixte de Recherche 7144 and Université Pierre et Marie Curie, Paris 6, Station Biologique de Roscoff, France
- * E-mail: (JP); (CdV)
| |
Collapse
|
34
|
Tewari S, Spouge JL. Coalescent: an open-source and scalable framework for exact calculations in coalescent theory. BMC Bioinformatics 2012; 13:257. [PMID: 23033878 PMCID: PMC3575375 DOI: 10.1186/1471-2105-13-257] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2011] [Accepted: 10/02/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Currently, there is no open-source, cross-platform and scalable framework for coalescent analysis in population genetics. There is no scalable GUI based user application either. Such a framework and application would not only drive the creation of more complex and realistic models but also make them truly accessible. RESULTS As a first attempt, we built a framework and user application for the domain of exact calculations in coalescent analysis. The framework provides an API with the concepts of model, data, statistic, phylogeny, gene tree and recursion. Infinite-alleles and infinite-sites models are considered. It defines pluggable computations such as counting and listing all the ancestral configurations and genealogies and computing the exact probability of data. It can visualize a gene tree, trace and visualize the internals of the recursion algorithm for further improvement and attach dynamically a number of output processors. The user application defines jobs in a plug-in like manner so that they can be activated, deactivated, installed or uninstalled on demand. Multiple jobs can be run and their inputs edited. Job inputs are persisted across restarts and running jobs can be cancelled where applicable. CONCLUSIONS Coalescent theory plays an increasingly important role in analysing molecular population genetic data. Models involved are mathematically difficult and computationally challenging. An open-source, scalable framework that lets users immediately take advantage of the progress made by others will enable exploration of yet more difficult and realistic models. As models become more complex and mathematically less tractable, the need for an integrated computational approach is obvious. Object oriented designs, though has upfront costs, are practical now and can provide such an integrated approach.
Collapse
Affiliation(s)
- Susanta Tewari
- National Center for Biotechnology Information, Bethesda, MD 20894, USA.
| | | |
Collapse
|
35
|
Park Y, Sheetlin S, Ma N, Madden TL, Spouge JL. New finite-size correction for local alignment score distributions. BMC Res Notes 2012; 5:286. [PMID: 22691307 PMCID: PMC3483159 DOI: 10.1186/1756-0500-5-286] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2012] [Accepted: 05/16/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Local alignment programs often calculate the probability that a match occurred by chance. The calculation of this probability may require a "finite-size" correction to the lengths of the sequences, as an alignment that starts near the end of either sequence may run out of sequence before achieving a significant score. FINDINGS We present an improved finite-size correction that considers the distribution of sequence lengths rather than simply the corresponding means. This approach improves sensitivity and avoids substituting an ad hoc length for short sequences that can underestimate the significance of a match. We use a test set derived from ASTRAL to show improved ROC scores, especially for shorter sequences. CONCLUSIONS The new finite-size correction improves the calculation of probabilities for a local alignment. It is now used in the BLAST+ package and at the NCBI BLAST web site ( http://blast.ncbi.nlm.nih.gov).
Collapse
Affiliation(s)
- Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA
| | | | | | | | | |
Collapse
|
36
|
Abstract
This chapter describes a workflow for measuring the efficacy of a barcode in identifying species. First, assemble individual sequence databases corresponding to each barcode marker. A controlled collection of taxonomic data is preferable to GenBank data, because GenBank data can be problematic, particularly when comparing barcodes based on more than one marker. To ensure proper controls when evaluating species identification, specimens not having a sequence in every marker database should be discarded. Second, select a computer algorithm for assigning species to barcode sequences. No algorithm has yet improved notably on assigning a specimen to the species of its nearest neighbor within a barcode database. Because global sequence alignments (e.g., with the Needleman-Wunsch algorithm, or some related algorithm) examine entire barcode sequences, they generally produce better species assignments than local sequence alignments (e.g., with BLAST). No neighboring method (e.g., global sequence similarity, global sequence distance, or evolutionary distance based on a global alignment) has yet shown a notable superiority in identifying species. Finally, "the probability of correct identification" (PCI) provides an appropriate measurement of barcode efficacy. The overall PCI for a data set is the average of the species PCIs, taken over all species in the data set. This chapter states explicitly how to calculate PCI, how to estimate its statistical sampling error, and how to use data on PCR failure to set limits on how much improvements in PCR technology can improve species identification.
Collapse
Affiliation(s)
- John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| | | |
Collapse
|
37
|
Sheetlin S, Park Y, Spouge JL. Objective method for estimating asymptotic parameters, with an application to sequence alignment. Phys Rev E Stat Nonlin Soft Matter Phys 2011; 84:031914. [PMID: 22060410 PMCID: PMC3233989 DOI: 10.1103/physreve.84.031914] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2011] [Revised: 06/14/2011] [Indexed: 05/31/2023]
Abstract
Sequence alignment is an indispensable computational tool in modern molecular biology. The model underlying biological sequence alignment is of interest to physicists because it approximates the statistical mechanics of DNA and protein annealing, while bearing an intimate relationship to models of directed polymers in random media. Recent methods for determining the statistics of random sequence alignments have reduced the computation time to less than 1 s, opening up some interesting possibilities for online computation with biological search engines. Before implementation, however, the methods required an objective technique for computing regression coefficients pertinent to an asymptotic regime. Typically, physicists estimate parameters pertinent to an asymptotic regime subjectively: They eyeball their data; estimate the asymptotic regime where the regression model holds with reasonable accuracy; and then regress data only within the estimated asymptotic regime. Our publicly available computer program ARRP replaces the subjective assessment of the asymptotic regime with an objective change-point detection method, increasing confidence in the scientific objectivity of the parameter estimates. Asymptotic regression has potential applications across most of physics.
Collapse
Affiliation(s)
- Sergey Sheetlin
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA
| | - Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA
| | - John L. Spouge
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA
| |
Collapse
|
38
|
Abstract
MOTIVATION Since database retrieval is a fundamental operation, the measurement of retrieval efficacy is critical to progress in bioinformatics. This article points out some issues with current methods of measuring retrieval efficacy and suggests some improvements. In particular, many studies have used the pooled receiver operating characteristic for n irrelevant records (ROC(n)) score, the area under the ROC curve (AUC) of a 'pooled' ROC curve, truncated at n irrelevant records. Unfortunately, the pooled ROC(n) score does not faithfully reflect actual usage of retrieval algorithms. Additionally, a pooled ROC(n) score can be very sensitive to retrieval results from as little as a single query. METHODS To replace the pooled ROC(n) score, we propose the Threshold Average Precision (TAP-k), a measure closely related to the well-known average precision in information retrieval, but reflecting the usage of E-values in bioinformatics. Furthermore, in addition to conditions previously given in the literature, we introduce three new criteria that an ideal measure of retrieval efficacy should satisfy. RESULTS PSI-BLAST, GLOBAL, HMMER and RPS-BLAST provided examples of using the TAP-k and pooled ROC(n) scores to evaluate sequence retrieval algorithms. In particular, compelling examples using real data highlight the drawbacks of the pooled ROC(n) score, showing that it can produce evaluations skewing far from intuitive expectations. In contrast, the TAP-k satisfies most of the criteria desired in an ideal measure of retrieval efficacy. AVAILABILITY AND IMPLEMENTATION The TAP-k web server and downloadable Perl script are freely available at http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html.ncbi/tap/
Collapse
Affiliation(s)
- Hyrum D Carroll
- National Center for Biotechnology Information, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
39
|
Park Y, Sheetlin S, Spouge JL. ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. Ann Stat 2009; 37:3697. [PMID: 20148197 DOI: 10.1214/08-aos663] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.
Collapse
Affiliation(s)
- Yonil Park
- National Center for Biotechnology Information National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, Maryland 20894 USA
| | | | | |
Collapse
|
40
|
Abstract
Reliable detection of cis-regulatory elements in promoter regions is a difficult and unsolved problem in computational biology. The intricacy of transcriptional regulation in higher eukaryotes, primarily in metazoans, could be a major driving force of organismal complexity. Eukaryotic genome annotations have improved greatly due to large-scale characterization of full-length cDNAs, transcriptional start sites (TSSs), and comparative genomics. Regulatory elements are identified in promoter regions using a variety of enumerative or alignment-based methods. Here we present a survey of recent computational methods for eukaryotic promoter analysis and describe the use of an alignment-based method implemented in the A-GLAM program.
Collapse
Affiliation(s)
- Leonardo Mariño-Ramírez
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD, USA
| | | | | | | |
Collapse
|
41
|
Abstract
Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human-fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple 'overalignment' P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | | | | | | |
Collapse
|
42
|
Kim NK, Tharakaraman K, Mariño-Ramírez L, Spouge JL. Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites. BMC Bioinformatics 2008; 9:262. [PMID: 18533028 PMCID: PMC2432075 DOI: 10.1186/1471-2105-9-262] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2007] [Accepted: 06/04/2008] [Indexed: 12/03/2022] Open
Abstract
Background Biologically active sequence motifs often have positional preferences with respect to a genomic landmark. For example, many known transcription factor binding sites (TFBSs) occur within an interval [-300, 0] bases upstream of a transcription start site (TSS). Although some programs for identifying sequence motifs exploit positional information, most of them model it only implicitly and with ad hoc methods, making them unsuitable for general motif searches. Results A-GLAM, a user-friendly computer program for identifying sequence motifs, now incorporates a Bayesian model systematically combining sequence and positional information. A-GLAM's predictions with and without positional information were compared on two human TFBS datasets, each containing sequences corresponding to the interval [-2000, 0] bases upstream of a known TSS. A rigorous statistical analysis showed that positional information significantly improved the prediction of sequence motifs, and an extensive cross-validation study showed that A-GLAM's model was robust against mild misspecification of its parameters. As expected, when sequences in the datasets were successively truncated to the intervals [-1000, 0], [-500, 0] and [-250, 0], positional information aided motif prediction less and less, but never hurt it significantly. Conclusion Although sequence truncation is a viable strategy when searching for biologically active motifs with a positional preference, a probabilistic model (used reasonably) generally provides a superior and more robust strategy, particularly when the sequence motifs' positional preferences are not well characterized.
Collapse
Affiliation(s)
- Nak-Kyeong Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.
| | | | | | | |
Collapse
|
43
|
Tharakaraman K, Bodenreider O, Landsman D, Spouge JL, Mariño-Ramírez L. The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site. Nucleic Acids Res 2008; 36:2777-86. [PMID: 18367472 PMCID: PMC2377430 DOI: 10.1093/nar/gkn137] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
A number of previous studies have predicted transcription factor binding sites (TFBSs) by exploiting the position of genomic landmarks like the transcriptional start site (TSS). The studies’ methods are generally too computationally intensive for genome-scale investigation, so the full potential of ‘positional regulomics’ to discover TFBSs and determine their function remains unknown. Because databases often annotate the genomic landmarks in DNA sequences, the methodical exploitation of positional regulomics has become increasingly urgent. Accordingly, we examined a set of 7914 human putative promoter regions (PPRs) with a known TSS. Our methods identified 1226 eight-letter DNA words with significant positional preferences with respect to the TSS, of which only 608 of the 1226 words matched known TFBSs. Many groups of genes whose PPRs contained a common word displayed similar expression profiles and related biological functions, however. Most interestingly, our results included 78 words, each of which clustered significantly in two or three different positions relative to the TSS. Often, the gene groups corresponding to different positional clusters of the same word corresponded to diverse functions, e.g. activation or repression in different tissues. Thus, different clusters of the same word likely reflect the phenomenon of ‘positional regulation’, i.e. a word's regulatory function can vary with its position relative to a genomic landmark, a conclusion inaccessible to methods based purely on sequence. Further integrative analysis of words co-occurring in PPRs also yielded 24 different groups of genes, likely identifying cis-regulatory modules de novo. Whereas comparative genomics requires precise sequence alignments, positional regulomics exploits genomic landmarks to provide a ‘poor man's alignment’. By exploiting the phenomenon of positional regulation, it uses position to differentiate the biological functions of subsets of TFBSs sharing a common sequence motif.
Collapse
Affiliation(s)
- Kannan Tharakaraman
- Computational Biology Branch, National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA
| | | | | | | | | |
Collapse
|
44
|
Kann MG, Sheetlin SL, Park Y, Bryant SH, Spouge JL. The identification of complete domains within protein sequences using accurate E-values for semi-global alignment. Nucleic Acids Res 2007; 35:4678-85. [PMID: 17596268 PMCID: PMC1950549 DOI: 10.1093/nar/gkm414] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance.
Collapse
Affiliation(s)
| | | | | | | | - John L. Spouge
- *To whom correspondence should be addressed.301 402 9310301 480 2484
| |
Collapse
|
45
|
Abstract
Computer analysis of biological sequences often detects deviations from a random model. In the usual model, sequence letters are chosen independently, according to some fixed distribution over the relevant alphabet. Real biological sequences often contain simple repeats, however, which can be broadly characterized as multiple contiguous copies (usually inexact) of a specific word. This paper quantifies inexact simple repeats as local sums in a Markov additive process (MAP). The maximum of the local sums has an asymptotic distribution with two parameters (λ and k), which are given by general MAP formulas. The general MAP formulas are usually computationally intractable, but an essential simplification in the case of repeats permits λ and k to be computed from matrices whose dimension equals the size of the relevant alphabet. The simplification applies to some MAPs where the summand distributions do not depend on consecutive pairs of Markov states as usual, but on pairs with a fixed time-lag larger than one.
Collapse
|
46
|
Abstract
MOTIVATION Many computational methods for identifying regulatory elements use a likelihood ratio between motif and background models. Often, the methods use a background model of independent bases. At least two different Markov background models have been proposed with the aim of increasing the accuracy of predicting regulatory elements. Both Markov background models suffer theoretical drawbacks, so this article develops a third, context-dependent Markov background model from fundamental statistical principles. RESULTS Datasets containing known regulatory elements in eukaryotes provided a basis for comparing the predictive accuracies of the different background models. Non-parametric statistical tests indicated that Markov models of order 3 constituted a statistically significant improvement over the background model of independent bases. Our model performed slightly better than the previous Markov background models. We also found that for discriminating between the predictive accuracies of competing background models, the correlation coefficient is a more sensitive measure than the performance coefficient. AVAILABILITY Our C++ program is available at ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM/2006-07-19
Collapse
Affiliation(s)
- Nak-Kyeong Kim
- National Center for Biotechnology Information, National Library of Medicine National Institutes of Health, Bethesda, MD 20894, USA
| | | | | |
Collapse
|
47
|
Tharakaraman K, Mariño-Ramírez L, Sheetlin SL, Landsman D, Spouge JL. Scanning sequences after Gibbs sampling to find multiple occurrences of functional elements. BMC Bioinformatics 2006; 7:408. [PMID: 16961919 PMCID: PMC1599759 DOI: 10.1186/1471-2105-7-408] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2006] [Accepted: 09/08/2006] [Indexed: 12/05/2022] Open
Abstract
Background Many DNA regulatory elements occur as multiple instances within a target promoter. Gibbs sampling programs for finding DNA regulatory elements de novo can be prohibitively slow in locating all instances of such an element in a sequence set. Results We describe an improvement to the A-GLAM computer program, which predicts regulatory elements within DNA sequences with Gibbs sampling. The improvement adds an optional "scanning step" after Gibbs sampling. Gibbs sampling produces a position specific scoring matrix (PSSM). The new scanning step resembles an iterative PSI-BLAST search based on the PSSM. First, it assigns an "individual score" to each subsequence of appropriate length within the input sequences using the initial PSSM. Second, it computes an E-value from each individual score, to assess the agreement between the corresponding subsequence and the PSSM. Third, it permits subsequences with E-values falling below a threshold to contribute to the underlying PSSM, which is then updated using the Bayesian calculus. A-GLAM iterates its scanning step to convergence, at which point no new subsequences contribute to the PSSM. After convergence, A-GLAM reports predicted regulatory elements within each sequence in order of increasing E-values, so users have a statistical evaluation of the predicted elements in a convenient presentation. Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per input sequence, the scanning step can now rapidly locate further instances of the element in each sequence. Conclusion Datasets from experiments determining the binding sites of transcription factors were used to evaluate the improvement to A-GLAM. Typically, the datasets included several sequences containing multiple instances of a regulatory motif. The improvements to A-GLAM permitted it to predict the multiple instances.
Collapse
Affiliation(s)
- Kannan Tharakaraman
- Computational Biology Branch, National Center for Biotechnology Information National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA
| | - Leonardo Mariño-Ramírez
- Computational Biology Branch, National Center for Biotechnology Information National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA
| | - Sergey L Sheetlin
- Computational Biology Branch, National Center for Biotechnology Information National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA
| | - David Landsman
- Computational Biology Branch, National Center for Biotechnology Information National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA
| | - John L Spouge
- Computational Biology Branch, National Center for Biotechnology Information National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA
| |
Collapse
|
48
|
Tharakaraman K, Mariño-Ramírez L, Sheetlin S, Landsman D, Spouge JL. Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics 2006; 21 Suppl 1:i440-8. [PMID: 15961489 PMCID: PMC1317086 DOI: 10.1093/bioinformatics/bti1028] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The transcription start site (TSS) has been located for an increasing number of genes across several organisms. Statistical tests have shown that some cis-acting regulatory elements have positional preferences with respect to the TSS, but few strategies have emerged for locating elements by their positional preferences. This paper elaborates such a strategy. First, we align promoter regions without gaps, anchoring the alignment on each promoter's TSS. Second, we apply a novel word-specific mask. Third, we apply a clustering test related to gapless BLAST statistics. The test examines whether any specific word is placed unusually consistently with respect to the TSS. Finally, our program A-GLAM, an extension of the GLAM program, uses significant word positions as new 'anchors' to realign the sequences. A Gibbs sampling algorithm then locates putative cis-acting regulatory elements. Usually, Gibbs sampling requires a preliminary masking step, to avoid convergence onto a dominant but uninteresting signal from a DNA repeat. However, since the positional anchors focus A-GLAM on the motif of interest, masking DNA repeats during Gibbs sampling becomes unnecessary. RESULTS In a set of human DNA sequences with experimentally characterized TSSs, the placement of 791 octonucleotide words was unusually consistent (multiple test corrected P < 0.05). Alignments anchored on these words sometimes located statistically significant motifs inaccessible to GLAM or AlignACE. AVAILABILITY The A-GLAM program and a list of statistically significant words are available at ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM/.
Collapse
Affiliation(s)
- Kannan Tharakaraman
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health Building 38A, 8600 Rockville Pike, Bethesda, MD 20894-6075, USA
| | | | | | | | | |
Collapse
|
49
|
Abstract
The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter λ and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter λ can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243–260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and k within the errors required (λ, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor.
Collapse
Affiliation(s)
| | | | - John L. Spouge
- To whom correspondence should be addressed. Tel: +301 402 9310; Fax: +301 480 2288;
| |
Collapse
|
50
|
Park Y, Sheetlin S, Spouge JL. Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment. ACTA ACUST UNITED AC 2004. [DOI: 10.1088/0305-4470/38/1/006] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|