Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Download

Total Articles

99
(from Reference Citation Analysis)

Article PDFs (25)

Cited by > 0 (79)

Searched Name

John L Spouge

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Type

Show more Refine

Article Statistics

Refine

MESH Headings

Show more Refine

First Author

Show more Refine

First Author Affiliations

Show more Refine

Authors

Show more Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Countries/Regions

Show more Refine

Affiliations

Show more Refine

Corresponding Author Affiliations

Show more Refine

Category

Show more Refine

Number

Citation Analysis

Lam V, Sharma S, Gupta S, Spouge JL, Jordan IK, Mariño-Ramírez L. Ancestry-attenuated effects of socioeconomic deprivation on type 2 diabetes disparities in the All of Us cohort. BMC Glob Public Health 2023;1:22. [PMID: 38045036 PMCID: PMC10693462 DOI: 10.1186/s44263-023-00025-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 09/28/2023] [Indexed: 12/05/2023]

Abstract

Background

Diabetes is a common disease with a major burden on morbidity, mortality, and productivity. Type 2 diabetes (T2D) accounts for roughly 90% of all diabetes cases in the USA and has a greater observed prevalence among those who identify as Black or Hispanic.

Methods

This study aimed to assess T2D racial and ethnic disparities using the All of Us Research Program data and to measure associations between genetic ancestry (GA), socioeconomic deprivation, and T2D. We used the All of Us Researcher Workbench to analyze T2D prevalence and model its associations with GA, individual-level (iSDI), and zip code-based (zSDI) socioeconomic deprivation indices among participant self-identified race and ethnicity (SIRE) groups.

Results

The study cohort of 86,488 participants from the four largest SIRE groups in All of Us: Asian (n = 2311), Black (n = 16,282), Hispanic (n = 16,966), and White (n = 50,292). SIRE groups show characteristic genetic ancestry patterns, consistent with their diverse origins, together with a continuum of ancestry fractions within and between groups. The Black and Hispanic groups show the highest levels of socioeconomic deprivation, followed by the Asian and White groups. Black participants show the highest age- and sex-adjusted T2D prevalence (21.9%), followed by the Hispanic (19.9%), Asian (15.1%), and White (14.8%) groups. Minority SIRE groups and socioeconomic deprivation, both iSDI and zSDI, are positively associated with T2D, when the entire cohort is analyzed together. However, SIRE and GA both show negative interaction effects with iSDI and zSDI on T2D. Higher levels of iSDI and zSDI are negatively associated with T2D in the Black and Hispanic groups, and higher levels of iSDI and zSDI are negatively associated with T2D at high levels of African and Native American ancestry.

Conclusions

Socioeconomic deprivation is associated with a higher prevalence of T2D in Black and Hispanic minority groups, compared to the majority White group. Nonetheless, socioeconomic deprivation is associated with reduced T2D risk within the Black and Hispanic groups. These results are paradoxical and have not been reported elsewhere, with possible explanations related to the nature of the All of Us data along with SIRE group differences in access to healthcare, diet, and lifestyle.

Collapse

Lam V, Sharma S, Gupta S, Spouge JL, Jordan IK, Mariño-Ramírez L. Ancestry-attenuated effects of socioeconomic deprivation on type 2 diabetes disparities in the All of Us cohort. Res Sq 2023:rs.3.rs-2976764. [PMID: 37790565 PMCID: PMC10543018 DOI: 10.21203/rs.3.rs-2976764/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]

Abstract

Background

Diabetes is a common disease with a major burden on morbidity, mortality, and productivity. Type 2 diabetes (T2D) accounts for roughly 90% of all diabetes cases in the United States and has greater observed prevalence among those who identify as Black or Hispanic.

Methods

The aims of this study were to determine whether T2D racial and ethnic disparities can be observed in data from the All of Us Research Program and to measure associations of genetic ancestry (GA) and socioeconomic deprivation with T2D. The All of Us Researcher Workbench was used to calculate T2D prevalence and to model T2D associations with GA, individual-level (iSDI) and zip code-based (zSDI) socioeconomic deprivation indices within and between participant self-identified race and ethnicity (SIRE) groups.

Results

The study cohort of 86,488 participants from the four largest SIRE groups in All of Us: Asian (n=2,311), Black (n=16,282), Hispanic (n=16,966), and White (n=50,292). SIRE groups show characteristic genetic ancestry patterns, consistent with their diverse origins, together with a continuum of ancestry fractions within and between groups. The Black and Hispanic groups show the highest median SDI values, followed by the Asian and White groups. Black participants show the highest age- and sex-adjusted T2D prevalence (21.9%), followed by the Hispanic (19.9%), Asian (15.1%), and White (14.8%) groups. Minority SIRE groups and socioeconomic deprivation are positively associated with T2D, when the entire cohort is analyzed together. However, SIRE and GA both show negative interaction effects with SDI on T2D. Higher levels of SDI are negatively associated with T2D in the Black and Hispanic groups, and higher levels of SDI are negatively associated with T2D at high levels of African and Native American ancestry.

Conclusion

Socioeconomic deprivation is positively associated with the SIRE group T2D disparities observed here but negatively associated with T2D within the Black and Hispanic groups that show the highest T2D prevalence. These results are paradoxical and have not been reported elsewhere. We discuss possible explanations for this paradox related to the nature of the All of Us data along with SIRE group differences in access to healthcare, diet, and lifestyle.

Collapse

Stanke Z, Spouge JL. Estimating age-stratified transmission and reproduction numbers during the early exponential phase of an epidemic: A case study with COVID-19 data. Epidemics 2023;44:100714. [PMID: 37595401 PMCID: PMC10528737 DOI: 10.1016/j.epidem.2023.100714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Revised: 06/07/2023] [Accepted: 08/08/2023] [Indexed: 08/20/2023] Open

Abstract

In a pending pandemic, early knowledge of age-specific disease parameters, e.g., susceptibility, infectivity, and the clinical fraction (the fraction of infections coming to clinical attention), supports targeted public health responses like school closures or sequestration of the elderly. The earlier the knowledge, the more useful it is, so the present article examines an early phase of many epidemics, exponential growth. Using age-stratified COVID-19 case counts collected in Canada, China, Israel, Italy, the Netherlands, and the United Kingdom before April 23, 2020, we present a linear analysis of the exponential phase that attempts to estimate the age-specific disease parameters given above. Some combinations of the parameters can be estimated by requiring that they change smoothly with age. The estimation yielded: (1) the case susceptibility, defined for each age-group as the product of susceptibility to infection and the clinical fraction; (2) the mean number of transmissions of infection per contact within each age-group; and (3) the reproduction number of infection within each age-group, i.e., the diagonal of the age-stratified next-generation matrix. Our restriction to data from the exponential phase indicates the combinations of epidemic parameters that are intrinsically easiest to estimate with early age-stratified case counts. For example, conclusions concerning the age-dependence of case susceptibility appeared more robust than corresponding conclusions about infectivity. Generally, the analysis produced some results consistent with conclusions confirmed much later in the COVID-19 pandemic. Notably, our analysis showed that in some countries, the reproduction number of infection within the half-decade 70-75 was unusually large compared to other half-decades. Our analysis therefore could have anticipated that without countermeasures, COVID-19 would spread rapidly once seeded in homes for the elderly.

Collapse

Frith MC, Shaw J, Spouge JL. How to optimally sample a sequence for rapid analysis. Bioinformatics 2023;39:7005197. [PMID: 36702468 PMCID: PMC9907223 DOI: 10.1093/bioinformatics/btad057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Accepted: 01/24/2023] [Indexed: 01/28/2023] Open

Spouge JL. A closed formula relevant to 'Theory of local k-mer selection with applications to long-read alignment' by Jim Shaw and Yun William Yu. Bioinformatics 2022;38:4848-4849. [PMID: 36063041 PMCID: PMC9801975 DOI: 10.1093/bioinformatics/btac604] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 07/11/2022] [Accepted: 09/01/2022] [Indexed: 01/05/2023] Open

Spouge JL. A comprehensive estimation of country-level basic reproduction numbers R0 for COVID-19: Regime regression can automatically estimate the end of the exponential phase in epidemic data. PLoS One 2021;16:e0254145. [PMID: 34255772 PMCID: PMC8277067 DOI: 10.1371/journal.pone.0254145] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 06/18/2021] [Indexed: 12/30/2022] Open

Abstract

In a compartmental epidemic model, the initial exponential phase reflects a fixed interaction between an infectious agent and a susceptible population in steady state, so it determines the basic reproduction number R₀ on its own. After the exponential phase, dynamic complexities like societal responses muddy the practical interpretation of many estimated parameters. The computer program ARRP, already available from sequence alignment applications, automatically estimated the end of the exponential phase in COVID-19 and extracted the exponential growth rate r for 160 countries. By positing a gamma-distributed generation time, the exponential growth method then yielded R₀ estimates for COVID-19 in 160 countries. The use of ARRP ensured that the R₀ estimates were largely freed from any dependency outside the exponential phase. The Prem matrices quantify rates of effective contact for infectious disease. Without using any age-stratified COVID-19 data, but under strong assumptions about the homogeneity of susceptibility, infectiousness, etc., across different age-groups, the Prem contact matrices also yielded theoretical R₀ estimates for COVID-19 in 152 countries, generally in quantitative conflict with the R₀ estimates derived from the exponential growth method. An exploratory analysis manipulating only the Prem contact matrices reduced the conflict, suggesting that age-groups under 20 years did not promote the initial exponential growth of COVID-19 as much as other age-groups. The analysis therefore supports tentatively and tardily, but independently of age-stratified COVID-19 data, the low priority given to vaccinating younger age groups. It also supports the judicious reopening of schools. The exploratory analysis also supports the possibility of suspecting differences in epidemic spread among different age-groups, even before substantial amounts of age-stratified data become available.

Collapse

Martín MP, Daniëls PP, Erickson D, Spouge JL. Correction: Figures of merit and statistics for detecting faulty species identification with DNA barcodes: A case study in Ramaria and related fungal genera. PLoS One 2021;16:e0250030. [PMID: 33826666 PMCID: PMC8026041 DOI: 10.1371/journal.pone.0250030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Melzak KA, Spouge JL, Boecker C, Kirschhöfer F, Brenner-Weiss G, Bieback K. Hemolysis Pathways during Storage of Erythrocytes and Inter-Donor Variability in Erythrocyte Morphology. Transfus Med Hemother 2021;48:39-47. [PMID: 33708051 DOI: 10.1159/000508711] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 05/03/2020] [Indexed: 01/10/2023] Open

Spouge JL, Ziegelbauer JM, Gonzalez M. A linear-time algorithm that avoids inverses and computes Jackknife (leave-one-out) products like convolutions or other operators in commutative semigroups. Algorithms Mol Biol 2020;15:17. [PMID: 32968428 PMCID: PMC7502207 DOI: 10.1186/s13015-020-00178-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Accepted: 09/08/2020] [Indexed: 11/10/2022] Open

Martín MP, Daniëls PP, Erickson D, Spouge JL. Figures of merit and statistics for detecting faulty species identification with DNA barcodes: A case study in Ramaria and related fungal genera. PLoS One 2020;15:e0237507. [PMID: 32813726 PMCID: PMC7437900 DOI: 10.1371/journal.pone.0237507] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2019] [Accepted: 07/28/2020] [Indexed: 11/19/2022] Open

Abstract

DNA barcoding can identify biological species and provides an important tool in diverse applications, such as conserving species and identifying pathogens, among many others. If combined with statistical tests, DNA barcoding can focus taxonomic scrutiny onto anomalous species identifications based on morphological features. Accordingly, we put nonparametric tests into a taxonomic context to answer questions about our sequence dataset of the formal fungal barcode, the nuclear ribosomal internal transcribed spacer (ITS). For example, does DNA barcoding concur with annotated species identifications significantly better if expert taxonomists produced the annotations? Does species assignment improve significantly if sequences are restricted to lengths greater than 500 bp? Both questions require a figure of merit to measure of the accuracy of species identification, typically provided by the probability of correct identification (PCI). Many articles on DNA barcoding use variants of PCI to measure the accuracy of species identification, but do not provide the variants with names, and the absence of explicit names hinders the recognition that the different variants are not comparable from study to study. We provide four variant PCIs with a name and show that for fixed data they follow systematic inequalities. Despite custom, therefore, their comparison is at a minimum problematic. Some popular PCI variants are particularly vulnerable to errors in species annotation, insensitive to improvements in a barcoding pipeline, and unable to predict identification accuracy as a database grows, making them unsuitable for many purposes. Generally, the Fractional PCI has the best properties as a figure of merit for species identification. The fungal genus Ramaria provides unusual taxonomic difficulties. As a case study, it shows that a good taxonomic background can be combined with the pertinent summary statistics of molecular results to improve the identification of doubtful samples, linking both disciplines synergistically.

Collapse

Patel V, Spouge JL. Estimating the basic reproduction number of a pathogen in a single host when only a single founder successfully infects. PLoS One 2020;15:e0227127. [PMID: 31923263 PMCID: PMC6953795 DOI: 10.1371/journal.pone.0227127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Accepted: 12/12/2019] [Indexed: 11/27/2022] Open

Carroll HD, Spouge JL, Gonzalez M. MultiDomainBenchmark: a multi-domain query and subject database suite. BMC Bioinformatics 2019;20:77. [PMID: 30764761 PMCID: PMC6376684 DOI: 10.1186/s12859-019-2660-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Accepted: 01/28/2019] [Indexed: 11/10/2022] Open

Manzourolajdad A, Spouge JL. Structural prediction of RNA switches using conditional base-pair probabilities. PLoS One 2019;14:e0217625. [PMID: 31188853 PMCID: PMC6561571 DOI: 10.1371/journal.pone.0217625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Accepted: 05/15/2019] [Indexed: 11/23/2022] Open

Spouge JL. An accurate approximation for the expected site frequency spectrum in a Galton-Watson process under an infinite sites mutation model. Theor Popul Biol 2019;127:7-15. [PMID: 30876864 DOI: 10.1016/j.tpb.2019.03.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 03/04/2019] [Accepted: 03/05/2019] [Indexed: 01/26/2023]

Abstract

If viruses or other pathogens infect a single host, the outcome of infection often hinges on the fate of the initial invaders. The initial basic reproduction number R₀, the expected number of cells infected by a single infected cell, helps determine whether the initial viruses can establish a successful beachhead. To determine R₀, the Kingman coalescent or continuous-time birth-and-death process can be used to infer the rate of exponential growth in an historical population. Given M sequences sampled in the present, the two models can make the inference from the site frequency spectrum (SFS), the count of mutations that appear in exactly k sequences (k=1,2,…,M). In the case of viruses, however, if R₀ is large and an infected cell bursts while propagating virus, the two models are suspect, because they are Markovian with only binary branching. Accordingly, this article develops an approximation for the SFS of a discrete-time branching process with synchronous generations (i.e., a Galton-Watson process). When evaluated in simulations with an asynchronous, non-Markovian model (a Bellman-Harris process) with parameters intended to mimic the bursting viral reproduction of HIV, the approximation proved superior to approximations derived from the Kingman coalescent or continuous-time birth-and-death process. This article demonstrates that in analogy to methods in human genetics, the SFS of viral sequences sampled well after latent infection can remain informative about the initial R₀. Thus, it suggests the utility of analyzing the SFS of sequences derived from patient and animal trials of viral therapies, because in some cases, the initial R₀ may be able to indicate subtle therapeutic progress, even in the absence of statistically significant differences in the infection of treatment and control groups.

Collapse

Tang K, Ren J, Cronn R, Erickson DL, Milligan BG, Parker-Forney M, Spouge JL, Sun F. Alignment-free genome comparison enables accurate geographic sourcing of white oak DNA. BMC Genomics 2018;19:896. [PMID: 30526482 PMCID: PMC6288960 DOI: 10.1186/s12864-018-5253-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 11/15/2018] [Indexed: 01/14/2023] Open

Gauran IIM, Park J, Lim J, Park D, Zylstra J, Peterson T, Kann M, Spouge JL. Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data. Biometrics 2017;74:458-471. [PMID: 28940296 DOI: 10.1111/biom.12779] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2016] [Revised: 08/01/2017] [Accepted: 08/01/2017] [Indexed: 11/28/2022]

Gonzalez M, DeVico AL, Spouge JL. Conserved signatures indicate HIV-1 transmission is under strong selection and thus is not a "stochastic" process. Retrovirology 2017;14:13. [PMID: 28231858 PMCID: PMC5324211 DOI: 10.1186/s12977-016-0326-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2016] [Accepted: 12/22/2016] [Indexed: 11/23/2022] Open

Acevedo-Luna N, Mariño-Ramírez L, Halbert A, Hansen U, Landsman D, Spouge JL. Most of the tight positional conservation of transcription factor binding sites near the transcription start site reflects their co-localization within regulatory modules. BMC Bioinformatics 2016;17:479. [PMID: 27871221 PMCID: PMC5117513 DOI: 10.1186/s12859-016-1354-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Accepted: 11/11/2016] [Indexed: 11/24/2022] Open

Abstract

Background

Transcription factors (TFs) form complexes that bind regulatory modules (RMs) within DNA, to control specific sets of genes. Some transcription factor binding sites (TFBSs) near the transcription start site (TSS) display tight positional preferences relative to the TSS. Furthermore, near the TSS, RMs can co-localize TFBSs with each other and the TSS. The proportion of TFBS positional preferences due to TFBS co-localization within RMs is unknown, however. ChIP experiments confirm co-localization of some TFBSs genome-wide, including near the TSS, but they typically examine only a few TFs at a time, using non-physiological conditions that can vary from lab to lab. In contrast, sequence analysis can examine many TFs uniformly and methodically, broadly surveying the co-localization of TFBSs with tight positional preferences relative to the TSS.

Results

Our statistics found 43 significant sets of human motifs in the JASPAR TF Database with positional preferences relative to the TSS, with 38 preferences tight (±5 bp). Each set of motifs corresponded to a gene group of 135 to 3304 genes, with 42/43 (98%) gene groups independently validated by DAVID, a gene ontology database, with FDR < 0.05. Motifs corresponding to two TFBSs in a RM should co-occur more than by chance alone, enriching the intersection of the gene groups corresponding to the two TFs. Thus, a gene-group intersection systematically enriched beyond chance alone provides evidence that the two TFs participate in an RM. Of the 903 = 43*42/2 intersections of the 43 significant gene groups, we found 768/903 (85%) pairs of gene groups with significantly enriched intersections, with 564/768 (73%) intersections independently validated by DAVID with FDR < 0.05. A user-friendly web site at http://go.usa.gov/3kjsH permits biologists to explore the interaction network of our TFBSs to identify candidate subunit RMs.

Conclusions

Gene duplication and convergent evolution within a genome provide obvious biological mechanisms for replicating an RM near the TSS that binds a particular TF subunit. Of all intersections of our 43 significant gene groups, 85% were significantly enriched, with 73% of the significant enrichments independently validated by gene ontology. The co-localization of TFBSs within RMs therefore likely explains much of the tight TFBS positional preferences near the TSS.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-016-1354-5) contains supplementary material, which is available to authorized users.

Collapse

Manzourolajdad A, Gonzalez M, Spouge JL. Changes in the Plasticity of HIV-1 Nef RNA during the Evolution of the North American Epidemic. PLoS One 2016;11:e0163688. [PMID: 27685447 PMCID: PMC5042412 DOI: 10.1371/journal.pone.0163688] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2016] [Accepted: 09/13/2016] [Indexed: 02/04/2023] Open

Spouge JL. Finite-size corrections to Poisson approximations of rare events in renewal processes. J Appl Probab 2016. [DOI: 10.1239/jap/996986762] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Spouge JL. Path reversal, islands, and the gapped alignment of random sequences. J Appl Probab 2016. [DOI: 10.1239/jap/1101840544] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Spouge JL. A branching-process solution of the polydisperse coagulation equation. ADV APPL PROBAB 2016. [DOI: 10.2307/1427224] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]

Sheetlin S, Park Y, Frith MC, Spouge JL. ALP & FALP: C++ libraries for pairwise local alignment E-values. Bioinformatics 2015;32:304-5. [PMID: 26428291 DOI: 10.1093/bioinformatics/btv575] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2015] [Accepted: 09/28/2015] [Indexed: 11/13/2022] Open

Tewari S, Spouge JL. Coalescent: an open-science framework for importance sampling in coalescent theory. PeerJ 2015;3:e1203. [PMID: 26312189 PMCID: PMC4548476 DOI: 10.7717/peerj.1203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2014] [Accepted: 07/30/2015] [Indexed: 11/20/2022] Open

Abstract

Background. In coalescent theory, computer programs often use importance sampling to calculate likelihoods and other statistical quantities. An importance sampling scheme can exploit human intuition to improve statistical efficiency of computations, but unfortunately, in the absence of general computer frameworks on importance sampling, researchers often struggle to translate new sampling schemes computationally or benchmark against different schemes, in a manner that is reliable and maintainable. Moreover, most studies use computer programs lacking a convenient user interface or the flexibility to meet the current demands of open science. In particular, current computer frameworks can only evaluate the efficiency of a single importance sampling scheme or compare the efficiencies of different schemes in an ad hoc manner. Results. We have designed a general framework (http://coalescent.sourceforge.net; language: Java; License: GPLv3) for importance sampling that computes likelihoods under the standard neutral coalescent model of a single, well-mixed population of constant size over time following infinite sites model of mutation. The framework models the necessary core concepts, comes integrated with several data sets of varying size, implements the standard competing proposals, and integrates tightly with our previous framework for calculating exact probabilities. For a given dataset, it computes the likelihood and provides the maximum likelihood estimate of the mutation parameter. Well-known benchmarks in the coalescent literature validate the accuracy of the framework. The framework provides an intuitive user interface with minimal clutter. For performance, the framework switches automatically to modern multicore hardware, if available. It runs on three major platforms (Windows, Mac and Linux). Extensive tests and coverage make the framework reliable and maintainable. Conclusions. In coalescent theory, many studies of computational efficiency consider only effective sample size. Here, we evaluate proposals in the coalescent literature, to discover that the order of efficiency among the three importance sampling schemes changes when one considers running time as well as effective sample size. We also describe a computational technique called "just-in-time delegation" available to improve the trade-off between running time and precision by constructing improved importance sampling schemes from existing ones. Thus, our systems approach is a potential solution to the "2(8) programs problem" highlighted by Felsenstein, because it provides the flexibility to include or exclude various features of similar coalescent models or importance sampling schemes.

Collapse

Carroll HD, Williams AC, Davis AG, Spouge JL. Improving Retrieval Efficacy of Homology Searches Using the False Discovery Rate. IEEE/ACM Trans Comput Biol Bioinform 2015;12:531-537. [PMID: 26357264 PMCID: PMC4568567 DOI: 10.1109/tcbb.2014.2366112] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Silva JC, Egan A, Arze C, Spouge JL, Harris DG. A new method for estimating species age supports the coexistence of malaria parasites and their Mammalian hosts. Mol Biol Evol 2015;32:1354-64. [PMID: 25589738 PMCID: PMC4408405 DOI: 10.1093/molbev/msv005] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open

Abstract

Species in the genus Plasmodium cause malaria in humans and infect a variety of mammals and other vertebrates. Currently, estimated ages for several mammalian Plasmodium parasites differ by as much as one order of magnitude, an inaccuracy that frustrates reliable estimation of evolutionary rates of disease-related traits. We developed a novel statistical approach to dating the relative age of evolutionary lineages, based on Total Least Squares regression. We validated this lineage dating approach by applying it to the genus Drosophila. Using data from the Drosophila 12 Genomes project, our approach accurately reconstructs the age of well-established Drosophila clades, including the speciation event that led to the subgenera Drosophila and Sophophora, and age of the melanogaster species subgroup. We applied this approach to hundreds of loci from seven mammalian Plasmodium species. We demonstrate the existence of a molecular clock specific to individual Plasmodium proteins, and estimate the relative age of mammalian-infecting Plasmodium. These analyses indicate that: 1) the split between the human parasite Plasmodium vivax and P. knowlesi, from Old World monkeys, occurred 6.1 times earlier than that between P. falciparum and P. reichenowi, parasites of humans and chimpanzees, respectively; and 2) mammalian Plasmodium parasites originated 22 times earlier than the split between P. falciparum and P. reichenowi. Calibrating the absolute divergence times for Plasmodium with eukaryotic substitution rates, we show that the split between P. falciparum and P. reichenowi occurred 3.0-5.5 Ma, and that mammalian Plasmodium parasites originated over 64 Ma. Our results indicate that mammalian-infecting Plasmodium evolved contemporaneously with their hosts, with little evidence for parasite host-switching on an evolutionary scale, and provide a solid timeframe within which to place the evolution of new Plasmodium species.

Collapse

Sheetlin SL, Park Y, Frith MC, Spouge JL. Frameshift alignment: statistics and post-genomic applications. ACTA ACUST UNITED AC 2014;30:3575-82. [PMID: 25172925 DOI: 10.1093/bioinformatics/btu576] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]

Spouge JL, Mariño-Ramírez L, Sheetlin SL. Searching for repeats, as an example of using the generalised Ruzzo-Tompa algorithm to find optimal subsequences with gaps. Int J Bioinform Res Appl 2014;10:384-408. [PMID: 24989859 DOI: 10.1504/ijbra.2014.062991] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]

Mandoiu I, Pop M, Rajasekaran S, Spouge JL. This special issue includes a selection of papers presented at the 2nd IEEE International Conference. Introduction. Int J Bioinform Res Appl 2014;10:341-344. [PMID: 25715438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]

Spouge JL. Within a sample from a population, the distribution of the number of descendants of a subsample's most recent common ancestor. Theor Popul Biol 2013;92:51-4. [PMID: 24321308 DOI: 10.1016/j.tpb.2013.11.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2013] [Revised: 11/23/2013] [Accepted: 11/26/2013] [Indexed: 10/25/2022]

Gonzalez MW, Spouge JL. Domain analysis of symbionts and hosts (DASH) in a genome-wide survey of pathogenic human viruses. BMC Res Notes 2013;6:209. [PMID: 23706066 PMCID: PMC3672079 DOI: 10.1186/1756-0500-6-209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2012] [Accepted: 05/17/2013] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

In the coevolution of viruses and their hosts, viruses often capture host genes, gaining advantageous functions (e.g. immune system control). Identifying functional similarities shared by viruses and their hosts can help decipher mechanisms of pathogenesis and accelerate virus-targeted drug and vaccine development. Cellular homologs in viruses are usually documented using pairwise-sequence comparison methods. Yet, pairwise-sequence searches have limited sensitivity resulting in poor identification of divergent homologies.

RESULTS

Methods based on profiles from multiple sequences provide a more sensitive alternative to identify similarities in host-pathogen systems. The present work describes a profile-based bioinformatics pipeline that we call the Domain Analysis of Symbionts and Hosts (DASH). DASH provides a web platform for the functional analysis of viral and host genomes. This study uses Human Herpesvirus 8 (HHV-8) as a model to validate the methodology. Our results indicate that HHV-8 shares at least 29% of its genes with humans (fourteen immunomodulatory and ten metabolic genes). DASH also suggests functions for fifty-one additional HHV-8 structural and metabolic proteins. We also perform two other comparative genomics studies of human viruses: (1) a broad survey of eleven viruses of disparate sizes and transcription strategies; and (2) a closer examination of forty-one viruses of the order Mononegavirales. In the survey, DASH detects human homologs in 4/5 DNA viruses. None of the non-retro-transcribing RNA viruses in the survey showed evidence of homology to humans. The order Mononegavirales are also non-retro-transcribing RNA viruses, however, and DASH found homology in 39/41 of them. Mononegaviruses display larger fractions of human similarities (up to 75%) than any of the other RNA or DNA viruses (up to 55% and 29% respectively).

CONCLUSIONS

We conclude that gene sharing probably occurs between humans and both DNA and RNA viruses, in viral genomes of differing sizes, regardless of transcription strategies. Our method (DASH) simultaneously analyzes the genomes of two interacting species thereby mining functional information to identify shared as well as exclusive domains to each organism. Our results validate our approach, showing that DASH has potential as a pipeline for making therapeutic discoveries in other host-symbiont systems. DASH results are available at http://tinyurl.com/spouge-dash.

Collapse

Suwannasai N, Martín MP, Phosri C, Sihanonth P, Whalley AJS, Spouge JL. Fungi in Thailand: a case study of the efficacy of an ITS barcode for automatically identifying species within the Annulohypoxylon and Hypoxylon genera. PLoS One 2013;8:e54529. [PMID: 23390499 PMCID: PMC3563529 DOI: 10.1371/journal.pone.0054529] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2012] [Accepted: 12/13/2012] [Indexed: 11/20/2022] Open

Abstract

Thailand, a part of the Indo-Burma biodiversity hotspot, has many endemic animals and plants. Some of its fungal species are difficult to recognize and separate, complicating assessments of biodiversity. We assessed species diversity within the fungal genera Annulohypoxylon and Hypoxylon, which produce biologically active and potentially therapeutic compounds, by applying classical taxonomic methods to 552 teleomorphs collected from across Thailand. Using probability of correct identification (PCI), we also assessed the efficacy of automated species identification with a fungal barcode marker, ITS, in the model system of Annulohypoxylon and Hypoxylon. The 552 teleomorphs yielded 137 ITS sequences; in addition, we examined 128 GenBank ITS sequences, to assess biases in evaluating a DNA barcode with GenBank data. The use of multiple sequence alignment in a barcode database like BOLD raises some concerns about non-protein barcode markers like ITS, so we also compared species identification using different alignment methods. Our results suggest the following. (1) Multiple sequence alignment of ITS sequences is competitive with pairwise alignment when identifying species, so BOLD should be able to preserve its present bioinformatics workflow for species identification for ITS, and possibly therefore with at least some other non-protein barcode markers. (2) Automated species identification is insensitive to a specific choice of evolutionary distance, contributing to resolution of a current debate in DNA barcoding. (3) Statistical methods are available to address, at least partially, the possibility of expert misidentification of species. Phylogenetic trees discovered a cryptic species and strongly supported monophyletic clades for many Annulohypoxylon and Hypoxylon species, suggesting that ITS can contribute usefully to a barcode for these fungi. The PCIs here, derived solely from ITS, suggest that a fungal barcode will require secondary markers in Annulohypoxylon and Hypoxylon, however. The URL http://tinyurl.com/spouge-barcode contains computer programs and other supplementary material relevant to this article.

Collapse

Pawlowski J, Audic S, Adl S, Bass D, Belbahri L, Berney C, Bowser SS, Cepicka I, Decelle J, Dunthorn M, Fiore-Donno AM, Gile GH, Holzmann M, Jahn R, Jirků M, Keeling PJ, Kostka M, Kudryavtsev A, Lara E, Lukeš J, Mann DG, Mitchell EAD, Nitsche F, Romeralo M, Saunders GW, Simpson AGB, Smirnov AV, Spouge JL, Stern RF, Stoeck T, Zimmermann J, Schindel D, de Vargas C. CBOL protist working group: barcoding eukaryotic richness beyond the animal, plant, and fungal kingdoms. PLoS Biol 2012;10:e1001419. [PMID: 23139639 PMCID: PMC3491025 DOI: 10.1371/journal.pbio.1001419] [Citation(s) in RCA: 319] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Affiliation(s)

Jan Pawlowski Department of Genetics and Evolution, University of Geneva, Geneva, Switzerland * E-mail: (JP); (CdV)
Stéphane Audic Centre National de la Recherche Scientifique, Unité Mixte de Recherche 7144 and Université Pierre et Marie Curie, Paris 6, Station Biologique de Roscoff, France
Sina Adl Department of Soil Science, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
David Bass Department of Life Sciences, Natural History Museum, London, United Kingdom
Lassaâd Belbahri Laboratory of Soil Biology, University of Neuchâtel, Neuchâtel, Switzerland
Cédric Berney Department of Life Sciences, Natural History Museum, London, United Kingdom
Samuel S. Bowser Wadsworth Center, New York State Department of Health, Albany, New York, United States of America
Ivan Cepicka Department of Zoology, Charles University in Prague, Prague, Czech Republic
Johan Decelle Centre National de la Recherche Scientifique, Unité Mixte de Recherche 7144 and Université Pierre et Marie Curie, Paris 6, Station Biologique de Roscoff, France
Micah Dunthorn Department of Ecology, University of Kaiserslautern, Kaiserslautern, Germany
Anna Maria Fiore-Donno Institute of Botany and Landscape Ecology, University of Greifswald, Greifswald, Germany
Gillian H. Gile Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada
Maria Holzmann Department of Genetics and Evolution, University of Geneva, Geneva, Switzerland
Regine Jahn Botanischer Garten und Botanischer Museum Berlin-Dahlem, Freie Universität Berlin, Berlin, Germany
Miloslav Jirků Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
Patrick J. Keeling Canadian Institute for Advanced Research, Botany Department, University of British Columbia, Vancouver, British Columbia, Canada
Martin Kostka Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
Alexander Kudryavtsev Department of Genetics and Evolution, University of Geneva, Geneva, Switzerland Department of Invertebrate Zoology, St-Petersburg State University, St-Petersburg, Russia
Enrique Lara Laboratory of Soil Biology, University of Neuchâtel, Neuchâtel, Switzerland
Julius Lukeš Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
David G. Mann Royal Botanic Garden Edinburgh, Edinburgh, United Kingdom
Edward A. D. Mitchell Laboratory of Soil Biology, University of Neuchâtel, Neuchâtel, Switzerland
Frank Nitsche Allgemeine Ökologie, Universität zu Köln, Köln, Germany
Maria Romeralo Department of Systematic Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
Gary W. Saunders Department of Biology, University of New Brunswick, Fredericton, New Brunswick, Canada
Alastair G. B. Simpson Department of Biology, Life Sciences Centre, Halifax, Nova Scotia, Canada
Alexey V. Smirnov Department of Invertebrate Zoology, St-Petersburg State University, St-Petersburg, Russia
John L. Spouge National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Computational Biology Branch, Bethesda, Maryland, United States of America
Rowena F. Stern Sir Alister Hardy Foundation for Ocean Science, Citadel Hill, Plymouth, United Kingdom
Thorsten Stoeck Department of Ecology, University of Kaiserslautern, Kaiserslautern, Germany
Jonas Zimmermann Botanischer Garten und Botanischer Museum Berlin-Dahlem, Freie Universität Berlin, Berlin, Germany Justus-Liebig-University, Giessen, Germany
David Schindel Smithsonian Institution, National Museum of Natural History, Washington, DC, United States of America
Colomban de Vargas Centre National de la Recherche Scientifique, Unité Mixte de Recherche 7144 and Université Pierre et Marie Curie, Paris 6, Station Biologique de Roscoff, France * E-mail: (JP); (CdV)

Collapse

Tewari S, Spouge JL. Coalescent: an open-source and scalable framework for exact calculations in coalescent theory. BMC Bioinformatics 2012;13:257. [PMID: 23033878 PMCID: PMC3575375 DOI: 10.1186/1471-2105-13-257] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2011] [Accepted: 10/02/2012] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Currently, there is no open-source, cross-platform and scalable framework for coalescent analysis in population genetics. There is no scalable GUI based user application either. Such a framework and application would not only drive the creation of more complex and realistic models but also make them truly accessible.

RESULTS

As a first attempt, we built a framework and user application for the domain of exact calculations in coalescent analysis. The framework provides an API with the concepts of model, data, statistic, phylogeny, gene tree and recursion. Infinite-alleles and infinite-sites models are considered. It defines pluggable computations such as counting and listing all the ancestral configurations and genealogies and computing the exact probability of data. It can visualize a gene tree, trace and visualize the internals of the recursion algorithm for further improvement and attach dynamically a number of output processors. The user application defines jobs in a plug-in like manner so that they can be activated, deactivated, installed or uninstalled on demand. Multiple jobs can be run and their inputs edited. Job inputs are persisted across restarts and running jobs can be cancelled where applicable.

CONCLUSIONS

Coalescent theory plays an increasingly important role in analysing molecular population genetic data. Models involved are mathematically difficult and computationally challenging. An open-source, scalable framework that lets users immediately take advantage of the progress made by others will enable exploration of yet more difficult and realistic models. As models become more complex and mathematically less tractable, the need for an integrated computational approach is obvious. Object oriented designs, though has upfront costs, are practical now and can provide such an integrated approach.

Collapse

Park Y, Sheetlin S, Ma N, Madden TL, Spouge JL. New finite-size correction for local alignment score distributions. BMC Res Notes 2012;5:286. [PMID: 22691307 PMCID: PMC3483159 DOI: 10.1186/1756-0500-5-286] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2012] [Accepted: 05/16/2012] [Indexed: 11/10/2022] Open

Spouge JL, Mariño-Ramírez L. The practical evaluation of DNA barcode efficacy. Methods Mol Biol 2012;858:365-77. [PMID: 22684965 PMCID: PMC3410705 DOI: 10.1007/978-1-61779-591-6_17] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]

Sheetlin S, Park Y, Spouge JL. Objective method for estimating asymptotic parameters, with an application to sequence alignment. Phys Rev E Stat Nonlin Soft Matter Phys 2011;84:031914. [PMID: 22060410 PMCID: PMC3233989 DOI: 10.1103/physreve.84.031914] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2011] [Revised: 06/14/2011] [Indexed: 05/31/2023]

Carroll HD, Kann MG, Sheetlin SL, Spouge JL. Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics. ACTA ACUST UNITED AC 2010;26:1708-13. [PMID: 20505002 PMCID: PMC2894514 DOI: 10.1093/bioinformatics/btq270] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Park Y, Sheetlin S, Spouge JL. ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. Ann Stat 2009;37:3697. [PMID: 20148197 DOI: 10.1214/08-aos663] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Mariño-Ramírez L, Tharakaraman K, Spouge JL, Landsman D. Promoter analysis: gene regulatory motif identification with A-GLAM. Methods Mol Biol 2009;537:263-76. [PMID: 19378149 DOI: 10.1007/978-1-59745-251-9_13] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]

Frith MC, Park Y, Sheetlin SL, Spouge JL. The whole alignment and nothing but the alignment: the problem of spurious alignment flanks. Nucleic Acids Res 2008;36:5863-71. [PMID: 18796526 PMCID: PMC2566872 DOI: 10.1093/nar/gkn579] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Kim NK, Tharakaraman K, Mariño-Ramírez L, Spouge JL. Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites. BMC Bioinformatics 2008;9:262. [PMID: 18533028 PMCID: PMC2432075 DOI: 10.1186/1471-2105-9-262] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2007] [Accepted: 06/04/2008] [Indexed: 12/03/2022] Open

Tharakaraman K, Bodenreider O, Landsman D, Spouge JL, Mariño-Ramírez L. The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site. Nucleic Acids Res 2008;36:2777-86. [PMID: 18367472 PMCID: PMC2377430 DOI: 10.1093/nar/gkn137] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open

Abstract

A number of previous studies have predicted transcription factor binding sites (TFBSs) by exploiting the position of genomic landmarks like the transcriptional start site (TSS). The studies’ methods are generally too computationally intensive for genome-scale investigation, so the full potential of ‘positional regulomics’ to discover TFBSs and determine their function remains unknown. Because databases often annotate the genomic landmarks in DNA sequences, the methodical exploitation of positional regulomics has become increasingly urgent. Accordingly, we examined a set of 7914 human putative promoter regions (PPRs) with a known TSS. Our methods identified 1226 eight-letter DNA words with significant positional preferences with respect to the TSS, of which only 608 of the 1226 words matched known TFBSs. Many groups of genes whose PPRs contained a common word displayed similar expression profiles and related biological functions, however. Most interestingly, our results included 78 words, each of which clustered significantly in two or three different positions relative to the TSS. Often, the gene groups corresponding to different positional clusters of the same word corresponded to diverse functions, e.g. activation or repression in different tissues. Thus, different clusters of the same word likely reflect the phenomenon of ‘positional regulation’, i.e. a word's regulatory function can vary with its position relative to a genomic landmark, a conclusion inaccessible to methods based purely on sequence. Further integrative analysis of words co-occurring in PPRs also yielded 24 different groups of genes, likely identifying cis-regulatory modules de novo. Whereas comparative genomics requires precise sequence alignments, positional regulomics exploits genomic landmarks to provide a ‘poor man's alignment’. By exploiting the phenomenon of positional regulation, it uses position to differentiate the biological functions of subsets of TFBSs sharing a common sequence motif.

Collapse

Kann MG, Sheetlin SL, Park Y, Bryant SH, Spouge JL. The identification of complete domains within protein sequences using accurate E-values for semi-global alignment. Nucleic Acids Res 2007;35:4678-85. [PMID: 17596268 PMCID: PMC1950549 DOI: 10.1093/nar/gkm414] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Spouge JL. Markov Additive Processes and Repeats in Sequences. J Appl Probab 2007. [DOI: 10.1239/jap/1183667418] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Kim NK, Tharakaraman K, Spouge JL. Adding sequence context to a Markov background model improves the identification of regulatory elements. Bioinformatics 2006;22:2870-5. [PMID: 17068091 DOI: 10.1093/bioinformatics/btl528] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Tharakaraman K, Mariño-Ramírez L, Sheetlin SL, Landsman D, Spouge JL. Scanning sequences after Gibbs sampling to find multiple occurrences of functional elements. BMC Bioinformatics 2006;7:408. [PMID: 16961919 PMCID: PMC1599759 DOI: 10.1186/1471-2105-7-408] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2006] [Accepted: 09/08/2006] [Indexed: 12/05/2022] Open

Abstract

Background

Many DNA regulatory elements occur as multiple instances within a target promoter. Gibbs sampling programs for finding DNA regulatory elements de novo can be prohibitively slow in locating all instances of such an element in a sequence set.

Results

We describe an improvement to the A-GLAM computer program, which predicts regulatory elements within DNA sequences with Gibbs sampling. The improvement adds an optional "scanning step" after Gibbs sampling. Gibbs sampling produces a position specific scoring matrix (PSSM). The new scanning step resembles an iterative PSI-BLAST search based on the PSSM. First, it assigns an "individual score" to each subsequence of appropriate length within the input sequences using the initial PSSM. Second, it computes an E-value from each individual score, to assess the agreement between the corresponding subsequence and the PSSM. Third, it permits subsequences with E-values falling below a threshold to contribute to the underlying PSSM, which is then updated using the Bayesian calculus. A-GLAM iterates its scanning step to convergence, at which point no new subsequences contribute to the PSSM. After convergence, A-GLAM reports predicted regulatory elements within each sequence in order of increasing E-values, so users have a statistical evaluation of the predicted elements in a convenient presentation. Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per input sequence, the scanning step can now rapidly locate further instances of the element in each sequence.

Conclusion

Datasets from experiments determining the binding sites of transcription factors were used to evaluate the improvement to A-GLAM. Typically, the datasets included several sequences containing multiple instances of a regulatory motif. The improvements to A-GLAM permitted it to predict the multiple instances.

Collapse

Tharakaraman K, Mariño-Ramírez L, Sheetlin S, Landsman D, Spouge JL. Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics 2006;21 Suppl 1:i440-8. [PMID: 15961489 PMCID: PMC1317086 DOI: 10.1093/bioinformatics/bti1028] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Sheetlin S, Park Y, Spouge JL. The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment. Nucleic Acids Res 2005;33:4987-94. [PMID: 16147981 PMCID: PMC1199557 DOI: 10.1093/nar/gki800] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open

Park Y, Sheetlin S, Spouge JL. Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment. ACTA ACUST UNITED AC 2004. [DOI: 10.1088/0305-4470/38/1/006] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]