1
|
Thorvaldsen S, Hössjer O. Use of directed quasi-metric distances for quantifying the information of gene families. Biosystems 2024; 243:105256. [PMID: 38871243 DOI: 10.1016/j.biosystems.2024.105256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 06/06/2024] [Accepted: 06/11/2024] [Indexed: 06/15/2024]
Abstract
A large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical framework for doing so. In this paper, we present a multinomial probability space X as a general foundation for multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information that is infused in order to generate a sample of such data is quantified as a distance on X between the prior distribution of data and the empirical distribution of the sample. A number of distances on X are treated. All of them have an information theoretic interpretation, reflecting the information that the sampling mechanism provides about which variants that have a selective advantage and therefore appear more frequently compared to prior expectations. This includes distances on X based on mutual information, conditional mutual information, active information, and functional information. The functional information distance is singled out as particularly useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting times. The functional information is also a quasi-metric on X, with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment of protein families.
Collapse
Affiliation(s)
- Steinar Thorvaldsen
- Dept. of Education, Division of Science, UiT the Arctic University of Norway, Norway.
| | - Ola Hössjer
- Dept. of Mathematics, Stockholm University, Sweden.
| |
Collapse
|
2
|
Bajić D. Information Theory, Living Systems, and Communication Engineering. ENTROPY (BASEL, SWITZERLAND) 2024; 26:430. [PMID: 38785679 PMCID: PMC11120474 DOI: 10.3390/e26050430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 05/08/2024] [Accepted: 05/17/2024] [Indexed: 05/25/2024]
Abstract
Mainstream research on information theory within the field of living systems involves the application of analytical tools to understand a broad range of life processes. This paper is dedicated to an opposite problem: it explores the information theory and communication engineering methods that have counterparts in the data transmission process by way of DNA structures and neural fibers. Considering the requirements of modern multimedia, transmission methods chosen by nature may be different, suboptimal, or even far from optimal. However, nature is known for rational resource usage, so its methods have a significant advantage: they are proven to be sustainable. Perhaps understanding the engineering aspects of methods of nature can inspire a design of alternative green, stable, and low-cost transmission.
Collapse
Affiliation(s)
- Dragana Bajić
- Department of Communications and Signal Processing, Faculty of Technical Sciences, University of Novi Sad, Trg Dositeja Obradovica 6, 21000 Novi Sad, Serbia
| |
Collapse
|
3
|
Schneider TD. Generalizing the isothermal efficiency by using Gaussian distributions. PLoS One 2023; 18:e0279758. [PMID: 36626367 PMCID: PMC9831307 DOI: 10.1371/journal.pone.0279758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 11/28/2022] [Indexed: 01/11/2023] Open
Abstract
Unlike the Carnot heat engine efficiency published in 1824, an isothermal efficiency derived from thermodynamics and information theory can be applied to biological systems. The original approach by Pierce and Cutler in 1959 to derive the isothermal efficiency equation came from Shannon's channel capacity of 1949 and from Felker's 1952 determination of the minimum energy dissipation needed to gain a bit. In 1991 and 2010 Schneider showed how the isothermal efficiency equation can be applied to molecular machines and that this can be used to explain why several molecular machines are 70% efficient. Surprisingly, some macroscopic biological systems, such as whole ecosystems, are also 70% efficient but it is hard to see how this could be explained by a thermodynamic and molecular theory. The thesis of this paper is that the isothermal efficiency can be derived without using thermodynamics by starting from a set of independent Gaussian distributions. This novel derivation generalizes the isothermal efficiency equation for use at all levels of biology, from molecules to ecosystems.
Collapse
Affiliation(s)
- Thomas D. Schneider
- National Institutes of Health, National Cancer Institute, Center for Cancer Research, RNA Biology Laboratory, Frederick, MD, United States of America
- * E-mail:
| |
Collapse
|
4
|
Sánchez IE, Galpern EA, Garibaldi MM, Ferreiro DU. Molecular Information Theory Meets Protein Folding. J Phys Chem B 2022; 126:8655-8668. [PMID: 36282961 DOI: 10.1021/acs.jpcb.2c04532] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
We propose an application of molecular information theory to analyze the folding of single domain proteins. We analyze results from various areas of protein science, such as sequence-based potentials, reduced amino acid alphabets, backbone configurational entropy, secondary structure content, residue burial layers, and mutational studies of protein stability changes. We found that the average information contained in the sequences of evolved proteins is very close to the average information needed to specify a fold ∼2.2 ± 0.3 bits/(site·operation). The effective alphabet size in evolved proteins equals the effective number of conformations of a residue in the compact unfolded state at around 5. We calculated an energy-to-information conversion efficiency upon folding of around 50%, lower than the theoretical limit of 70%, but much higher than human-built macroscopic machines. We propose a simple mapping between molecular information theory and energy landscape theory and explore the connections between sequence evolution, configurational entropy, and the energetics of protein folding.
Collapse
Affiliation(s)
- Ignacio E Sánchez
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Ezequiel A Galpern
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Martín M Garibaldi
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Diego U Ferreiro
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| |
Collapse
|
5
|
Schneider TD, Jejjala V. Restriction enzymes use a 24 dimensional coding space to recognize 6 base long DNA sequences. PLoS One 2019; 14:e0222419. [PMID: 31671158 PMCID: PMC6822723 DOI: 10.1371/journal.pone.0222419] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2019] [Accepted: 08/29/2019] [Indexed: 11/19/2022] Open
Abstract
Restriction enzymes recognize and bind to specific sequences on invading bacteriophage DNA. Like a key in a lock, these proteins require many contacts to specify the correct DNA sequence. Using information theory we develop an equation that defines the number of independent contacts, which is the dimensionality of the binding. We show that EcoRI, which binds to the sequence GAATTC, functions in 24 dimensions. Information theory represents messages as spheres in high dimensional spaces. Better sphere packing leads to better communications systems. The densest known packing of hyperspheres occurs on the Leech lattice in 24 dimensions. We suggest that the single protein EcoRI molecule employs a Leech lattice in its operation. Optimizing density of sphere packing explains why 6 base restriction enzymes are so common.
Collapse
Affiliation(s)
- Thomas D. Schneider
- National Institutes of Health, National Cancer Institute, Center for Cancer Research, RNA Biology Laboratory, Frederick, Maryland, United States of America
| | - Vishnu Jejjala
- Mandelstam Institute for Theoretical Physics, School of Physics, NITheP, and CoE-MaSS, University of the Witwatersrand, Johannesburg, South Africa
- David Rittenhouse Laboratory, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
6
|
Hobden RM, Tétreault S. Motor Control and the Injured and Healthy Artist. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2014; 826:179-204. [DOI: 10.1007/978-1-4939-1338-1_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
7
|
Abstract
In this Perspective, we propose that communication theory--a field of mathematics concerned with the problems of signal transmission, reception and processing--provides a new quantitative lens for investigating multicellular biology, ancient and modern. What underpins the cohesive organisation and collective behaviour of multicellular ecosystems such as microbial colonies and communities (microbiomes) and multicellular organisms such as plants and animals, whether built of simple tissue layers (sponges) or of complex differentiated cells arranged in tissues and organs (members of the 35 or so phyla of the subkingdom Metazoa)? How do mammalian tissues and organs develop, maintain their architecture, become subverted in disease, and decline with age? How did single-celled organisms coalesce to produce many-celled forms that evolved and diversified into the varied multicellular organisms in existence today? Some answers can be found in the blueprints or recipes encoded in (epi)genomes, yet others lie in the generic physical properties of biological matter such as the ability of cell aggregates to attain a certain complexity in size, shape, and pattern. We suggest that Lasswell's maxim "Who says what to whom in what channel with what effect" provides a foundation for understanding not only the emergence and evolution of multicellularity, but also the assembly and sculpting of multicellular ecosystems and many-celled structures, whether of natural or human-engineered origin. We explore how the abstraction of communication theory as an organising principle for multicellular biology could be realised. We highlight the inherent ability of communication theory to be blind to molecular and/or genetic mechanisms. We describe selected applications that analyse the physics of communication and use energy efficiency as a central tenet. Whilst communication theory has and could contribute to understanding a myriad of problems in biology, investigations of multicellular biology could, in turn, lead to advances in communication theory, especially in the still immature field of network information theory.
Collapse
Affiliation(s)
- I S Mian
- Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| | | |
Collapse
|
8
|
Abstract
The idea that we could build molecular communications systems can be advanced by investigating how actual molecules from living organisms function. Information theory provides tools for such an investigation. This review describes how we can compute the average information in the DNA binding sites of any genetic control protein and how this can be extended to analyze its individual sites. A formula equivalent to Claude Shannon's channel capacity can be applied to molecular systems and used to compute the efficiency of protein binding. This efficiency is often 70% and a brief explanation for that is given. The results imply that biological systems have evolved to function at channel capacity, which means that we should be able to build molecular communications that are just as robust as our macroscopic ones.
Collapse
Affiliation(s)
- Thomas D. Schneider
- National Institutes of Health, National Cancer Institute at Frederick, P.O. Box B, Frederick, MD 21702-1201, United States
| |
Collapse
|
9
|
Schneider TD. 70% efficiency of bistate molecular machines explained by information theory, high dimensional geometry and evolutionary convergence. Nucleic Acids Res 2010; 38:5995-6006. [PMID: 20562221 PMCID: PMC2952855 DOI: 10.1093/nar/gkq389] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The relationship between information and energy is key to understanding biological systems. We can display the information in DNA sequences specifically bound by proteins by using sequence logos, and we can measure the corresponding binding energy. These can be compared by noting that one of the forms of the second law of thermodynamics defines the minimum energy dissipation required to gain one bit of information. Under the isothermal conditions that molecular machines function this is Emin = Kb T ln 2 joules per bit (kB is Boltzmann's constant and T is the absolute temperature). Then an efficiency of binding can be computed by dividing the information in a logo by the free energy of binding after it has been converted to bits. The isothermal efficiencies of not only genetic control systems, but also visual pigments are near 70%. From information and coding theory, the theoretical efficiency limit for bistate molecular machines is ln 2=0.6931. Evolutionary convergence to maximum efficiency is limited by the constraint that molecular states must be distinct from each other. The result indicates that natural molecular machines operate close to their information processing maximum (the channel capacity), and implies that nanotechnology can attain this goal.
Collapse
Affiliation(s)
- Thomas D Schneider
- Center for Cancer Research Nanobiology Program, National Cancer Institute, Frederick, MD 21702-1201, USA.
| |
Collapse
|
10
|
Lyakhov IG, Krishnamachari A, Schneider TD. Discovery of novel tumor suppressor p53 response elements using information theory. Nucleic Acids Res 2008; 36:3828-33. [PMID: 18495754 PMCID: PMC2441790 DOI: 10.1093/nar/gkn189] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
An accurate method for locating genes under tumor suppressor p53 control that is based on a well-established mathematical theory and built using naturally occurring, experimentally proven p53 sites is essential in understanding the complete p53 network. We used a molecular information theory approach to create a flexible model for p53 binding. By searching around transcription start sites in human chromosomes 1 and 2, we predicted 16 novel p53 binding sites and experimentally demonstrated that 15 of the 16 (94%) sites were bound by p53. Some were also bound by the related proteins p63 and p73. Thirteen of the adjacent genes were controlled by at least one of the proteins. Eleven of the 16 sites (69%) had not been identified previously. This molecular information theory approach can be extended to any genetic system to predict new sites for DNA-binding proteins.
Collapse
Affiliation(s)
- Ilya G Lyakhov
- Basic Research Program, SAIC-Frederick, Inc., NCI at Frederick, Frederick, MD, USA
| | | | | |
Collapse
|