1
|
Li W, Almirantis Y, Provata A. Range-limited Heaps' law for functional DNA words in the human genome. J Theor Biol 2024; 592:111878. [PMID: 38901778 DOI: 10.1016/j.jtbi.2024.111878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/31/2024] [Accepted: 06/10/2024] [Indexed: 06/22/2024]
Abstract
Heaps' or Herdan-Heaps' law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the Heaps' law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps' law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps' law was observed, but a quadratic regression in log-log type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.
Collapse
Affiliation(s)
- Wentian Li
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA(1); The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Yannis Almirantis
- Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and Applications, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| | - Astero Provata
- Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| |
Collapse
|
2
|
Epley BR. Digest: Few new mutations are recessive lethal. Evolution 2023; 77:1914-1915. [PMID: 37354114 PMCID: PMC10373208 DOI: 10.1093/evolut/qpad117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 06/22/2023] [Indexed: 06/26/2023]
Abstract
When a new mutation arises, what is the probability that it is recessive lethal? Wade et al. find that fewer than 1% of nonsynonymous mutations in humans and Drosophila melanogaster are recessive lethal. The authors show that methods based on site frequency spectrum (SFS) analyses, though generally robust in their estimations of the nonlethal distribution of fitness effects (DFE), are unable to accurately estimate the fraction of recessive lethal mutations.
Collapse
Affiliation(s)
- Benjamin R Epley
- Ecology and Evolution, University of Chicago, Chicago, IL, United States
| |
Collapse
|
3
|
Luo J, Wang S, Zhang S, He Y, Li S, Han J, Xu M, Deng G. Performance of ImproGene Cell-Free DNA Tubes for Stabilization and Analysis of cfDNA in Blood Samples. Fetal Pediatr Pathol 2022; 41:771-780. [PMID: 34547970 DOI: 10.1080/15513815.2021.1979143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
BACKGROUND With the development of liquid biopsy technology, the demand for noninvasive prenatal testing (NIPT) is increasing rapidly. The aim of the study is to evaluate the effects of different blood collection tubes on plasma cfDNA and NIPT quality control. METHODS We investigated hemolysis, cfDNA concentration, and fragment distribution within blood samples stored in EDTA, ImproGene, and Streck tubes. The effects of ImproGene and Streck tubes on NIPT quality control were evaluated. RESULTS The ImproGene tubes prevented the time-dependent increase of cfDNA concentration and preserved the cfDNA fragment size distribution. For NIPT quality control, there is no significant difference in cfDNA, library concentration, and fetal fraction between ImproGene and Streck tubes samples. GC content of the samples in ImproGene tubes was closer to the human genome. CONCLUSION The ImproGene cfDNA tube has excellent performance and is an effective choice for storing blood samples for NIPT testing or other cfDNA analysis.
Collapse
Affiliation(s)
- Jianglan Luo
- Enterprise Key Laboratory, Enterprise Key Laboratory for Blood Compatibility of Medical Materials, Guangdong, China
| | - Sina Wang
- Enterprise Key Laboratory, Enterprise Key Laboratory for Blood Compatibility of Medical Materials, Guangdong, China
| | - Shu Zhang
- Enterprise Key Laboratory, Enterprise Key Laboratory for Blood Compatibility of Medical Materials, Guangdong, China
| | - Ye He
- Enterprise Key Laboratory, Enterprise Key Laboratory for Blood Compatibility of Medical Materials, Guangdong, China
| | - Siyun Li
- Enterprise Key Laboratory, Enterprise Key Laboratory for Blood Compatibility of Medical Materials, Guangdong, China
| | - Jianhong Han
- Enterprise Key Laboratory, Enterprise Key Laboratory for Blood Compatibility of Medical Materials, Guangdong, China
| | - Mingfei Xu
- Enterprise Key Laboratory, Enterprise Key Laboratory for Blood Compatibility of Medical Materials, Guangdong, China
| | - Guanhua Deng
- Enterprise Key Laboratory, Enterprise Key Laboratory for Blood Compatibility of Medical Materials, Guangdong, China
| |
Collapse
|
4
|
Almirantis Y, Provata A, Li W. Noether's Theorem as a Metaphor for Chargaff's 2nd Parity Rule in Genomics. J Mol Evol 2022; 90:231-238. [PMID: 35704064 DOI: 10.1007/s00239-022-10062-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 05/18/2022] [Indexed: 10/18/2022]
Abstract
In the present note, the genomic compositional rule largely known as 'Chargaff's 2nd parity rule' (asserting equimolarity between Adenine-Thymine and Guanine-Cytosine in any of the two DNA strands) is regarded in association with Noether's theorem linking symmetries with conservation laws in physics. In the case of the genome, the strict physical and mathematical prerequisites of Noether's theorem do not hold. However, we conclude that a metaphor can be established with Noether's theorem, as inter-strand symmetry concerning DNA functionality engenders specific features in genome composition. Inversely, when inter-strand symmetry does not hold, the corresponding quantitative relations fail to appear. This association is also considered from the point of view of the existence of emergent laws and properties in evolutionary genomics.
Collapse
Affiliation(s)
- Yannis Almirantis
- Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and Applications, National Center for Scientific Research "Demokritos", 15341, Athens, Greece.
| | - Astero Provata
- Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and Nanotechnology, National Center for Scientific Research, "Demokritos", 15341, Athens, Greece
| | - Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA
| |
Collapse
|
5
|
Li W, Almirantis Y, Provata A. Revisiting the neutral dynamics derived limiting guanine-cytosine content using human de novo point mutation data. Meta Gene 2022. [DOI: 10.1016/j.mgene.2021.100994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
|
6
|
Li W, Shih A, Freudenberg-Hua Y, Fury W, Yang Y. Beyond standard pipeline and p < 0.05 in pathway enrichment analyses. Comput Biol Chem 2021; 92:107455. [PMID: 33774420 PMCID: PMC9179938 DOI: 10.1016/j.compbiolchem.2021.107455] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 12/18/2020] [Accepted: 02/07/2021] [Indexed: 10/22/2022]
Abstract
A standard pathway/gene-set enrichment analysis, the over-representation analysis, is based on four values: the size of two gene-sets, size of their overlap, and size of the gene universe from which the gene-sets are chosen. The standard result of such an analysis is based on the p-value of a statistical test. We supplement this standard pipeline by six cautions: (1) any p-value threshold to distinguish enriched gene-sets from not-enriched ones is to certain degree arbitrary; (2) genes in a gene-set may be correlated, which potentially overcount the gene-set size; (3) any attempt to impose multiple testing correction will increase the false negative rate; (4) gene-sets in a gene-set database may be correlated, potentially overcount the factor for multiple testing correction; (5) the discrete nature of the data make it possible that a minimum change in counts may lead to a quantum change in the p-value threshold-based conclusion; (6) the two gene-sets may not be chosen from the universe of all human genes, but in fact from a subset of that universe, or even two different subsets of all genes. Careful reconsideration of these issues can have an impact on an enrichment analysis conclusion. Part of our cautions mirror the call from statistician that reaching conclusion from data is not a simple matter of p-value smaller than 0.05, but a thoughtful process with due diligences.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Andrew Shih
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Yun Freudenberg-Hua
- Litwin-Zucker Center for the study of Alzheimer's Disease, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA; Division of Geriatric Psychiatry, Zucker Hillside Hospital, Northwell Health, Glen Oaks, NY, USA
| | - Wen Fury
- Regeneron Pharmaceutical Inc., Tarrytown, NY, USA
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui, China
| |
Collapse
|
7
|
Grand Tour Algorithm: Novel Swarm-Based Optimization for High-Dimensional Problems. Processes (Basel) 2020. [DOI: 10.3390/pr8080980] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Agent-based algorithms, based on the collective behavior of natural social groups, exploit innate swarm intelligence to produce metaheuristic methodologies to explore optimal solutions for diverse processes in systems engineering and other sciences. Especially for complex problems, the processing time, and the chance to achieve a local optimal solution, are drawbacks of these algorithms, and to date, none has proved its superiority. In this paper, an improved swarm optimization technique, named Grand Tour Algorithm (GTA), based on the behavior of a peloton of cyclists, which embodies relevant physical concepts, is introduced and applied to fourteen benchmarking optimization problems to evaluate its performance in comparison to four other popular classical optimization metaheuristic algorithms. These problems are tackled initially, for comparison purposes, with 1000 variables. Then, they are confronted with up to 20,000 variables, a really large number, inspired in the human genome. The obtained results show that GTA clearly outperforms the other algorithms. To strengthen GTA’s value, various sensitivity analyses are performed to verify the minimal influence of the initial parameters on efficiency. It is demonstrated that the GTA fulfils the fundamental requirements of an optimization algorithm such as ease of implementation, speed of convergence, and reliability. Since optimization permeates modeling and simulation, we finally propose that GTA will be appealing for the agent-based community, and of great help for a wide variety of agent-based applications.
Collapse
|
8
|
Mordstein C, Savisaar R, Young RS, Bazile J, Talmane L, Luft J, Liss M, Taylor MS, Hurst LD, Kudla G. Codon Usage and Splicing Jointly Influence mRNA Localization. Cell Syst 2020; 10:351-362.e8. [PMID: 32275854 PMCID: PMC7181179 DOI: 10.1016/j.cels.2020.03.001] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Revised: 12/19/2019] [Accepted: 03/05/2020] [Indexed: 12/11/2022]
Abstract
In the human genome, most genes undergo splicing, and patterns of codon usage are splicing dependent: guanine and cytosine (GC) content is the highest within single-exon genes and within first exons of multi-exon genes. However, the effects of codon usage on gene expression are typically characterized in unspliced model genes. Here, we measured the effects of splicing on expression in a panel of synonymous reporter genes that varied in nucleotide composition. We found that high GC content increased protein yield, mRNA yield, cytoplasmic mRNA localization, and translation of unspliced reporters. Splicing did not affect the expression of GC-rich variants. However, splicing promoted the expression of AT-rich variants by increasing their steady-state protein and mRNA levels, in part through promoting cytoplasmic localization of mRNA. We propose that splicing promotes the nuclear export of AU-rich mRNAs and that codon- and splicing-dependent effects on expression are under evolutionary pressure in the human genome.
Collapse
Affiliation(s)
- Christine Mordstein
- MRC Human Genetics Unit, Institute for Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, UK; Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath, UK
| | - Rosina Savisaar
- Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath, UK; Instituto de Medicina Molecular, João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, Lisboa, Portugal
| | - Robert S Young
- MRC Human Genetics Unit, Institute for Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, UK; Centre for Global Health Research, Usher Institute, The University of Edinburgh, Edinburgh, UK
| | - Jeanne Bazile
- MRC Human Genetics Unit, Institute for Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, UK
| | - Lana Talmane
- MRC Human Genetics Unit, Institute for Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, UK
| | - Juliet Luft
- MRC Human Genetics Unit, Institute for Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, UK
| | - Michael Liss
- Thermo Fisher Scientific, GENEART GmbH, Regensburg, Germany
| | - Martin S Taylor
- MRC Human Genetics Unit, Institute for Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, UK
| | - Laurence D Hurst
- Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath, UK
| | - Grzegorz Kudla
- MRC Human Genetics Unit, Institute for Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
9
|
Renaud G, Hanghøj K, Korneliussen TS, Willerslev E, Orlando L. Joint Estimates of Heterozygosity and Runs of Homozygosity for Modern and Ancient Samples. Genetics 2019; 212:587-614. [PMID: 31088861 PMCID: PMC6614887 DOI: 10.1534/genetics.119.302057] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Accepted: 05/01/2019] [Indexed: 11/18/2022] Open
Abstract
Both the total amount and the distribution of heterozygous sites within individual genomes are informative about the genetic diversity of the population they belong to. Detecting true heterozygous sites in ancient genomes is complicated by the generally limited coverage achieved and the presence of post-mortem damage inflating sequencing errors. Additionally, large runs of homozygosity found in the genomes of particularly inbred individuals and of domestic animals can skew estimates of genome-wide heterozygosity rates. Current computational tools aimed at estimating runs of homozygosity and genome-wide heterozygosity levels are generally sensitive to such limitations. Here, we introduce ROHan, a probabilistic method which substantially improves the estimate of heterozygosity rates both genome-wide and for genomic local windows. It combines a local Bayesian model and a Hidden Markov Model at the genome-wide level and can work both on modern and ancient samples. We show that our algorithm outperforms currently available methods for predicting heterozygosity rates for ancient samples. Specifically, ROHan can delineate large runs of homozygosity (at megabase scales) and produce a reliable confidence interval for the genome-wide rate of heterozygosity outside of such regions from modern genomes with a depth of coverage as low as 5-6× and down to 7-8× for ancient samples showing moderate DNA damage. We apply ROHan to a series of modern and ancient genomes previously published and revise available estimates of heterozygosity for humans, chimpanzees and horses.
Collapse
Affiliation(s)
- Gabriel Renaud
- Lundbeck Foundation GeoGenetics Center, Globe Institute, University of Copenhagen, 1350K, Denmark
| | - Kristian Hanghøj
- Lundbeck Foundation GeoGenetics Center, Globe Institute, University of Copenhagen, 1350K, Denmark
- Laboratoire d'Anthropobiologie Moléculaire et d'Imagerie de Synthèse, CNRS UMR 5288, Université de Toulouse, Université Paul Sabatier, 31000, France
| | | | - Eske Willerslev
- Lundbeck Foundation GeoGenetics Center, Globe Institute, University of Copenhagen, 1350K, Denmark
- Department of Zoology, University of Cambridge, CB2 3EJ, UK
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- The Danish Institute for Advanced Study at The University of Southern Denmark, DK-5230 Odense M, Denmark
| | - Ludovic Orlando
- Lundbeck Foundation GeoGenetics Center, Globe Institute, University of Copenhagen, 1350K, Denmark
- Laboratoire d'Anthropobiologie Moléculaire et d'Imagerie de Synthèse, CNRS UMR 5288, Université de Toulouse, Université Paul Sabatier, 31000, France
| |
Collapse
|
10
|
Li W, Thanos D, Provata A. Quantifying local randomness in human DNA and RNA sequences using Erdös motifs. J Theor Biol 2018; 461:41-50. [PMID: 30336158 DOI: 10.1016/j.jtbi.2018.09.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Revised: 08/14/2018] [Accepted: 09/25/2018] [Indexed: 10/28/2022]
Abstract
In 1932, Paul Erdös asked whether a random walk constructed from a binary sequence can achieve the lowest possible deviation (lowest discrepancy), for the sequence itself and for all its subsequences formed by homogeneous arithmetic progressions. Although avoiding low discrepancy is impossible for infinite sequences, as recently proven by Terence Tao, attempts were made to construct such sequences with finite lengths. We recognize that such constructed sequences (we call these "Erdös sequences") exhibit certain hallmarks of randomness at the local level: they show roughly equal frequencies of short subsequences, and at the same time exclude trivial periodic patterns. For the human DNA we examine the frequency of a set of Erdös motifs of length-10 using three nucleotides-to-binary mappings. The particular length-10 Erdös sequence is derived from the length-11 Mathias sequence and is identical with the first 10 digits of the Thue-Morse sequence, underscoring the fact that both are deficient in periodicities. Our calculations indicate that: (1) the purine(A and G)/pyridimine(C and T) based Erdös motifs are greatly underrepresented in the human genome, (2) the strong(G and C)/weak(A and T) based Erdös motifs are slightly overrepresented, (3) the densities of the two are negatively correlated, (4) the Erdös motifs based on all three mappings being combined are slightly underrepresented, and (5) the strong/weak based Erdös motifs are greatly overrepresented in the human messenger RNA sequences.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Dimitrios Thanos
- Department of Mathematics, National and Kapodistrian University of Athens, Athens GR-15784, Greece; Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", Athens GR-15341, Greece
| | - Astero Provata
- Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", Athens GR-15341, Greece
| |
Collapse
|
11
|
Li W, Espinal-Enríquez J, Simpfendorfer KR, Hernández-Lemus E. A survey of disease connections for CD4+ T cell master genes and their directly linked genes. Comput Biol Chem 2015; 59 Pt B:78-90. [PMID: 26411796 DOI: 10.1016/j.compbiolchem.2015.08.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2015] [Revised: 08/18/2015] [Accepted: 08/21/2015] [Indexed: 02/07/2023]
Abstract
Genome-wide association studies and other genetic analyses have identified a large number of genes and variants implicating a variety of disease etiological mechanisms. It is imperative for the study of human diseases to put these genetic findings into a coherent functional context. Here we use system biology tools to examine disease connections of five master genes for CD4+ T cell subtypes (TBX21, GATA3, RORC, BCL6, and FOXP3). We compiled a list of genes functionally interacting (protein-protein interaction, or by acting in the same pathway) with the master genes, then we surveyed the disease connections, either by experimental evidence or by genetic association. Embryonic lethal genes (also known as essential genes) are over-represented in master genes and their interacting genes (55% versus 40% in other genes). Transcription factors are significantly enriched among genes interacting with the master genes (63% versus 10% in other genes). Predicted haploinsufficiency is a feature of most these genes. Disease-connected genes are enriched in this list of genes: 42% of these genes have a disease connection according to Online Mendelian Inheritance in Man (OMIM) (versus 23% in other genes), and 74% are associated with some diseases or phenotype in a Genome Wide Association Study (GWAS) (versus 43% in other genes). Seemingly, not all of the diseases connected to genes surveyed were immune related, which may indicate pleiotropic functions of the master regulator genes and associated genes.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, NY, USA.
| | - Jesús Espinal-Enríquez
- Computational Genomics Department, National Institute of Genomic Medicine, México, D.F., Mexico; Complexity in Systems Biology, Center for Complexity Sciences, Universidad Nacional Autónoma de México, México, D.F., Mexico
| | - Kim R Simpfendorfer
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, NY, USA
| | - Enrique Hernández-Lemus
- Computational Genomics Department, National Institute of Genomic Medicine, México, D.F., Mexico; Complexity in Systems Biology, Center for Complexity Sciences, Universidad Nacional Autónoma de México, México, D.F., Mexico
| |
Collapse
|
12
|
Li W, Freudenberg J, Oswald M. Principles for the organization of gene-sets. Comput Biol Chem 2015; 59 Pt B:139-49. [PMID: 26188561 DOI: 10.1016/j.compbiolchem.2015.04.005] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2015] [Accepted: 04/08/2015] [Indexed: 12/23/2022]
Abstract
A gene-set, an important concept in microarray expression analysis and systems biology, is a collection of genes and/or their products (i.e. proteins) that have some features in common. There are many different ways to construct gene-sets, but a systematic organization of these ways is lacking. Gene-sets are mainly organized ad hoc in current public-domain databases, with group header names often determined by practical reasons (such as the types of technology in obtaining the gene-sets or a balanced number of gene-sets under a header). Here we aim at providing a gene-set organization principle according to the level at which genes are connected: homology, physical map proximity, chemical interaction, biological, and phenotypic-medical levels. We also distinguish two types of connections between genes: actual connection versus sharing of a label. Actual connections denote direct biological interactions, whereas shared label connection denotes shared membership in a group. Some extensions of the framework are also addressed such as overlapping of gene-sets, modules, and the incorporation of other non-protein-coding entities such as microRNAs.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, NY, USA.
| | - Jan Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, NY, USA
| | - Michaela Oswald
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, NY, USA
| |
Collapse
|
13
|
Katsumi M, Ishikawa H, Tanaka Y, Saito K, Kobori Y, Okada H, Saito H, Nakabayashi K, Matsubara Y, Ogata T, Fukami M, Miyado M. Microhomology-mediated microduplication in the y chromosomal azoospermia factor a region in a male with mild asthenozoospermia. Cytogenet Genome Res 2015; 144:285-9. [PMID: 25765000 DOI: 10.1159/000377649] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/26/2015] [Indexed: 11/19/2022] Open
Abstract
Y chromosomal azoospermia factor (AZF) regions AZFa, AZFb and AZFc represent hotspots for copy number variations (CNVs) in the human genome; yet the number of reports of AZFa-linked duplications remains limited. Nonallelic homologous recombination has been proposed as the underlying mechanism of CNVs in AZF regions. In this study, we identified a hitherto unreported microduplication in the AZFa region in a Japanese male individual. The 629,812-bp duplication contained 22 of 46 exons of USP9Y, encoding the putative fine tuner of spermatogenesis, together with all exons of 3 other genes/pseudogenes. The breakpoints of the duplication resided in the DNA/TcMar-Tigger repeat and nonrepeat sequences, respectively, and were associated with a 2-bp microhomology, but not with short nucleotide stretches. The breakpoint-flanking regions were not enriched with GC content, palindromes, or noncanonical DNA structures. Semen analysis of the individual revealed a normal sperm concentration and mildly reduced sperm motility. The paternal DNA sample of the individual was not available for genetic analysis. The results indicate that CNVs in AZF regions can be generated by microhomology-mediated break-induced replication in the absence of known rearrangement-inducing DNA features. AZFa-linked microduplications likely permit production of a normal amount of sperm, although the precise clinical consequences of these CNVs await further investigation.
Collapse
Affiliation(s)
- Momori Katsumi
- Department of Molecular Endocrinology, National Research Institute for Child Health and Development, Tokyo, Japan
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Balakirev ES, Chechetkin VR, Lobzin VV, Ayala FJ. Computational methods of identification of pseudogenes based on functionality: entropy and GC content. Methods Mol Biol 2014; 1167:41-62. [PMID: 24823770 DOI: 10.1007/978-1-4939-0835-6_4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Spectral entropy and GC content analyses reveal comprehensive structural features of DNA sequences. To illustrate the significance of these features, we analyze the β-esterase gene cluster, including the Est-6 gene and the ψEst-6 putative pseudogene, in seven species of the Drosophila melanogaster subgroup. The spectral entropies show distinctly lower structural ordering for ψEst-6 than for Est-6 in all species studied. However, entropy accumulation is not a completely random process for either gene and it shows to be nucleotide dependent. Furthermore, GC content in synonymous positions is uniformly higher in Est-6 than in ψEst-6, in agreement with the reduced GC content generally observed in pseudogenes and nonfunctional sequences. The observed differences in entropy and GC content reflect an evolutionary shift associated with the process of pseudogenization and subsequent functional divergence of ψEst-6 and Est-6 after the duplication event. The data obtained show the relevance and significance of entropy and GC content analyses for pseudogene identification and for the comparative study of gene-pseudogene evolution.
Collapse
Affiliation(s)
- Evgeniy S Balakirev
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, USA,
| | | | | | | |
Collapse
|
15
|
Cell cycle regulation of purine synthesis by phosphoribosyl pyrophosphate and inorganic phosphate. Biochem J 2013; 454:91-9. [PMID: 23734909 DOI: 10.1042/bj20130153] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Cells must increase synthesis of purine nucleotides/deoxynucleotides before or during S-phase. We found that rates of purine synthesis via the de novo and salvage pathways increased 5.0- and 3.3-fold respectively, as cells progressed from mid-G1-phase to early S-phase. The increased purine synthesis could be attributed to a 3.2-fold increase in intracellular PRPP (5-phosphoribosyl-α-1-pyrophosphate), a rate-limiting substrate for de novo and salvage purine synthesis. PRPP can be produced by the oxidative and non-oxidative pentose phosphate pathways, and we found a 3.1-fold increase in flow through the non-oxidative pathway, with no change in oxidative pathway activity. Non-oxidative pentose phosphate pathway enzymes showed no change in activity, but PRPP synthetase is regulated by phosphate, and we found that phosphate uptake and total intracellular phosphate concentration increased significantly between mid-G1-phase and early S-phase. Over the same time period, PRPP synthetase activity increased 2.5-fold when assayed in the absence of added phosphate, making enzyme activity dependent on cellular phosphate at the time of extraction. We conclude that purine synthesis increases as cells progress from G1- to S-phase, and that the increase is from heightened PRPP synthetase activity due to increased intracellular phosphate.
Collapse
|
16
|
IsoPlotter(+): A Tool for Studying the Compositional Architecture of Genomes. ISRN BIOINFORMATICS 2013; 2013:725434. [PMID: 25937951 PMCID: PMC4393066 DOI: 10.1155/2013/725434] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Accepted: 04/01/2013] [Indexed: 11/18/2022]
Abstract
Eukaryotic genomes, particularly animal genomes, have a complex, nonuniform, and nonrandom internal compositional organization. The compositional organization of animal genomes can be described as a mosaic of discrete genomic regions, called “compositional domains,” each with a distinct GC content that significantly differs from those of its upstream and downstream neighboring domains. A typical animal genome consists of a mixture of compositionally homogeneous and nonhomogeneous domains of varying lengths and nucleotide compositions that are interspersed with one another. We have devised IsoPlotter, an unbiased segmentation algorithm for inferring the compositional organization of genomes. IsoPlotter has become an indispensable tool for describing genomic composition and has been used in the analysis of more than a dozen genomes. Applications include describing new genomes, correlating domain composition with gene composition and their density, studying the evolution of genomes, testing phylogenomic hypotheses, and detect regions of potential interbreeding between human and extinct hominines. To extend the use of IsoPlotter, we designed a completely automated pipeline, called IsoPlotter+ to carry out all segmentation analyses, including graphical display, and built a repository for compositional domain maps of all fully sequenced vertebrate and invertebrate genomes. The IsoPlotter+ pipeline and repository offer a comprehensive solution to the study of genome compositional architecture. Here, we demonstrate IsoPlotter+ by applying it to human and insect genomes. The computational tools and data repository are available online.
Collapse
|
17
|
Li W, Sosa D, Jose MV. Human repetitive sequence densities are mostly negatively correlated with R/Y-based nucleosome-positioning motifs and positively correlated with W/S-based motifs. Genomics 2013; 101:125-33. [DOI: 10.1016/j.ygeno.2012.10.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2012] [Revised: 10/28/2012] [Accepted: 10/29/2012] [Indexed: 01/25/2023]
|