1
|
Houdré C, Işlak Ü. A central limit theorem for the length of the longest common subsequences in random words. ELECTRON J PROBAB 2023. [DOI: 10.1214/22-ejp894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Affiliation(s)
- Christian Houdré
- School of Mathematics, Georgia Institute of Technology, Atlanta, Georgia, 30332-0160, USA
| | - Ümit Işlak
- Bogazici University, Faculty of Arts and Science, Department of Mathematics, 34342, Bebek-Istanbul, Turkey
| |
Collapse
|
2
|
Press WH. Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates. PNAS NEXUS 2022; 1:pgac252. [PMID: 36712375 PMCID: PMC9802387 DOI: 10.1093/pnasnexus/pgac252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Accepted: 10/31/2022] [Indexed: 11/06/2022]
Abstract
Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲104 barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳106 barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 106 barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.
Collapse
|
3
|
Vladimirov A, Shlosman S, Nechaev S. Brownian flights over a circle. Phys Rev E 2020; 102:012124. [PMID: 32794955 DOI: 10.1103/physreve.102.012124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 06/22/2020] [Indexed: 06/11/2023]
Abstract
The stationary radial distribution, P(ρ), of a random walk with the diffusion coefficient D, which winds at the tangential velocity V around an impenetrable disk of radius R for R≫D/V converges to the distribution involving the Airy function. Typical trajectories are localized in the circular strip [R,R+δR^{1/3}], where δ is a constant which depends on the parameters D and V and is independent of R.
Collapse
Affiliation(s)
| | - Senya Shlosman
- Institute of Information Transmission Problems RAS, 127051 Moscow, Russia
- Skolkovo Institute of Science and Technology, 143005 Skolkovo, Russia
- Aix-Marseille University, Universite of Toulon, CNRS, CPT UMR 7332, 13288 Marseille, France
| | - Sergei Nechaev
- Interdisciplinary Scientific Center Poncelet, CNRS UMI 2615, 119002 Moscow, Russia
- P. N. Lebedev Physical Institute RAS, 119991 Moscow, Russia
| |
Collapse
|
4
|
Nechaev S, Polovnikov K, Shlosman S, Valov A, Vladimirov A. Anomalous one-dimensional fluctuations of a simple two-dimensional random walk in a large-deviation regime. Phys Rev E 2019; 99:012110. [PMID: 30780340 DOI: 10.1103/physreve.99.012110] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2018] [Indexed: 11/07/2022]
Abstract
The following question is the subject of our work: could a two-dimensional (2D) random path pushed by some constraints to an improbable "large-deviation regime" possess extreme statistics with one-dimensional (1D) Kardar-Parisi-Zhang (KPZ) fluctuations? The answer is positive, though nonuniversal, since the fluctuations depend on the underlying geometry. We consider in detail two examples of 2D systems for which imposed external constraints force the underlying stationary stochastic process to stay in an atypical regime with anomalous statistics. The first example deals with the fluctuations of a stretched 2D random walk above a semicircle or a triangle. In the second example we consider a 2D biased random walk along a channel with forbidden voids of circular and triangular shapes. In both cases we are interested in the dependence of a typical span 〈d(t)〉∼t^{γ} of the trajectory of t steps above the top of the semicircle or the triangle. We show that γ=1/3, i.e., 〈d(t)〉 shares the KPZ statistics for the semicircle, while γ=0 for the triangle. We propose heuristic derivations of scaling exponents γ for different geometries, justify them by explicit analytic computations, and compare with numeric simulations. For practical purposes, our results demonstrate that the geometry of voids in a channel might have a crucial impact on the width of the boundary layer and, thus, on the heat transfer in the channel.
Collapse
Affiliation(s)
- Sergei Nechaev
- Interdisciplinary Scientific Center Poncelet, CNRS UMI 2615, 119002 Moscow, Russia.,P. N. Lebedev Physical Institute RAS, 119991 Moscow, Russia
| | - Kirill Polovnikov
- Physics Department, Lomonosov Moscow State University, 119992 Moscow, Russia.,Skolkovo Institute of Science and Technology, 143005 Skolkovo, Russia
| | - Senya Shlosman
- Skolkovo Institute of Science and Technology, 143005 Skolkovo, Russia.,Institute of Information Transmission Problems RAS, 127051 Moscow, Russia.,Aix-Marseille University, University of Toulon, CNRS, CPT UMR 7332, 13288, Marseille, France
| | - Alexander Valov
- N. N. Semenov Institute of Chemical Physics RAS, 119991 Moscow, Russia
| | | |
Collapse
|
5
|
Le Doussal P, Majumdar SN, Schehr G. Multicritical Edge Statistics for the Momenta of Fermions in Nonharmonic Traps. PHYSICAL REVIEW LETTERS 2018; 121:030603. [PMID: 30085768 DOI: 10.1103/physrevlett.121.030603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2018] [Revised: 05/24/2018] [Indexed: 06/08/2023]
Abstract
We compute the joint statistics of the momenta p_{i} of N noninteracting fermions in a trap, near the Fermi edge, with a particular focus on the largest one p_{max}. For a 1D harmonic trap, momenta and positions play a symmetric role, and hence the joint statistics of momenta are identical to that of the positions. In particular, p_{max}, as x_{max}, is distributed according to the Tracy-Widom distribution. Here we show that novel "momentum edge statistics" emerge when the curvature of the potential vanishes, i.e., for "flat traps" near their minimum, with V(x)∼x^{2n} and n>1. These are based on generalizations of the Airy kernel that we obtain explicitly. The fluctuations of p_{max} are governed by new universal distributions determined from the nth member of the second Painlevé hierarchy of nonlinear differential equations, with connections to multicritical random matrix models. Finite temperature extensions and possible experimental signatures in cold atoms are discussed.
Collapse
Affiliation(s)
- Pierre Le Doussal
- CNRS-Laboratoire de Physique Théorique de l'Ecole Normale Supérieure, 24 rue Lhomond, 75231 Paris Cedex, France
| | - Satya N Majumdar
- LPTMS, CNRS, Université Paris-Sud, Université Paris-Saclay, 91405 Orsay, France
| | - Grégory Schehr
- LPTMS, CNRS, Université Paris-Sud, Université Paris-Saclay, 91405 Orsay, France
| |
Collapse
|
6
|
De Luca A, Le Doussal P. Mutually avoiding paths in random media and largest eigenvalues of random matrices. Phys Rev E 2017; 95:030103. [PMID: 28415280 DOI: 10.1103/physreve.95.030103] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2016] [Indexed: 06/07/2023]
Abstract
Recently, it was shown that the probability distribution function (PDF) of the free energy of a single continuum directed polymer (DP) in a random potential, equivalently to the height of a growing interface described by the Kardar-Parisi-Zhang (KPZ) equation, converges at large scale to the Tracy-Widom distribution. The latter describes the fluctuations of the largest eigenvalue of a random matrix, drawn from the Gaussian unitary ensemble (GUE), and the result holds for a DP with fixed end points, i.e., for the KPZ equation with droplet initial conditions. A more general conjecture can be put forward, relating the free energies of N>1 noncrossing continuum DP in a random potential, to the sum of the Nth largest eigenvalues of the GUE. Here, using replica methods, we provide an important test of this conjecture by calculating exactly the right tails of both PDFs and showing that they coincide for arbitrary N.
Collapse
Affiliation(s)
- Andrea De Luca
- Laboratoire de Physique Théorique et Modèles Statistiques (UMR CNRS 8626), Université Paris-Sud, Orsay, France
| | - Pierre Le Doussal
- Laboratoire de Physique Théorique de l'ENS, CNRS & Ecole Normale Supérieure de Paris, Paris, France
| |
Collapse
|
7
|
Chu S, Kardar M. Probability distributions for directed polymers in random media with correlated noise. Phys Rev E 2016; 94:010101. [PMID: 27575059 DOI: 10.1103/physreve.94.010101] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Indexed: 06/06/2023]
Abstract
The probability distribution for the free energy of directed polymers in random media (DPRM) with uncorrelated noise in d=1+1 dimensions satisfies the Tracy-Widom distribution. We inquire if and how this universal distribution is modified in the presence of spatially correlated noise. The width of the distribution scales as the DPRM length to an exponent β, in good (but not full) agreement with previous renormalization group and numerical results. The scaled probability is well described by the Tracy-Widom form for uncorrelated noise, but becomes symmetric with increasing correlation exponent. We thus find a class of distributions that continuously interpolates between Tracy-Widom and Gaussian forms.
Collapse
Affiliation(s)
- Sherry Chu
- Department of Physics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Mehran Kardar
- Department of Physics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
8
|
Valba OV, Tamm MV, Nechaev SK. New alphabet-dependent morphological transition in random RNA alignment. PHYSICAL REVIEW LETTERS 2012; 109:018102. [PMID: 23031133 DOI: 10.1103/physrevlett.109.018102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/24/2011] [Indexed: 06/01/2023]
Abstract
We study the fraction f of nucleotides involved in the formation of a cactuslike secondary structure of random heteropolymer RNA-like molecules. In the low-temperature limit, we study this fraction as a function of the number c of different nucleotide species. We show, that with changing c, the secondary structures of random RNAs undergo a morphological transition: f(c)→1 for c≤c(cr) as the chain length n goes to infinity, signaling the formation of a virtually perfect gapless secondary structure; while f(c)<1 for c>c(cr), which means that a nonperfect structure with gaps is formed. The strict upper and lower bounds 2≤c(cr)≤4 are proven, and the numerical evidence for c(cr) is presented. The relevance of the transition from the evolutional point of view is discussed.
Collapse
Affiliation(s)
- O V Valba
- Moscow Institute of Physics and Technology, 141700, Dolgoprudny, Russia
| | | | | |
Collapse
|
9
|
Val’ba OV, Nechaev SK, Tamm MV. Interaction of RNA molecules: Binding energy and the statistical properties of random sequences. RUSSIAN JOURNAL OF PHYSICAL CHEMISTRY B 2012. [DOI: 10.1134/s1990793112060085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
10
|
Calabrese P, Le Doussal P. Exact solution for the Kardar-Parisi-Zhang equation with flat initial conditions. PHYSICAL REVIEW LETTERS 2011; 106:250603. [PMID: 21770622 DOI: 10.1103/physrevlett.106.250603] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2011] [Indexed: 05/31/2023]
Abstract
We provide the first exact calculation of the height distribution at arbitrary time t of the continuum Kardar-Parisi-Zhang (KPZ) growth equation in one dimension with flat initial conditions. We use the mapping onto a directed polymer with one end fixed, one free, and the Bethe ansatz for the replicated attractive boson model. We obtain the generating function of the moments of the directed polymer partition sum as a Fredholm Pfaffian. Our formula, valid for all times, exhibits convergence of the free energy (i.e., KPZ height) distribution to the Gaussian orthogonal ensemble Tracy-Widom distribution at large time.
Collapse
Affiliation(s)
- Pasquale Calabrese
- Dipartimento di Fisica dell'Università di Pisa and INFN, 56127 Pisa, Italy
| | | |
Collapse
|
11
|
Wolfsheimer S, Herms I, Rahmann S, Hartmann AK. Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics 2011; 12:47. [PMID: 21291566 PMCID: PMC3042914 DOI: 10.1186/1471-2105-12-47] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2010] [Accepted: 02/03/2011] [Indexed: 12/03/2022] Open
Abstract
Background Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet. Results In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach. Conclusions The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.
Collapse
Affiliation(s)
- Stefan Wolfsheimer
- Laboratoire MAP5 (UMR CNRS 8145), Université Paris Descartes, Paris, France.
| | | | | | | |
Collapse
|
12
|
Potestio R, Caccioli F, Vivo P. Random matrix approach to collective behavior and bulk universality in protein dynamics. PHYSICAL REVIEW LETTERS 2009; 103:268101. [PMID: 20366347 DOI: 10.1103/physrevlett.103.268101] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Indexed: 05/29/2023]
Abstract
Covariance matrices of amino acid displacements, commonly used to characterize the large-scale movements of proteins, are investigated through the prism of random matrix theory. Bulk universality is detected in the local spacing statistics of noise-dressed eigenmodes, which is well described by a Brody distribution with parameter beta approximately = 0.8. This finding, supported by other consistent indicators, implies a novel quantitative criterion to single out the collective degrees of freedom of the protein from the majority of high-energy, localized vibrations.
Collapse
Affiliation(s)
- Raffaello Potestio
- Scuola Internazionale Superiore di Studi Avanzati, Via Beirut 2/4-34151 Trieste, Italy
| | | | | |
Collapse
|
13
|
Tamm MV, Nechaev SK. Unzipping of two random heteropolymers: ground-state energy and finite-size effects. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2008; 78:011903. [PMID: 18763978 DOI: 10.1103/physreve.78.011903] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2007] [Indexed: 05/26/2023]
Abstract
We consider a pair of random heteropolymer chains with quenched primary sequences. For this system we have analyzed the dependence of average ground state energy per monomer E on chain length n in the ensemble of chains with uniform distribution of primary sequences of monomers. Every monomer of the first (second) chain is randomly and independently chosen with the uniform probability distribution p=1/c from a set of c different types A , B , C , D ,... (A', B', C', D',...) . Monomers of the first chain could form saturating reversible bonds with monomers of the second chain. The bonds between similar monomer types (such as A-A', B-B', C-C', etc.) have the attraction energy u , while the bonds between different monomer types (such as A-B', A-D', B-D', etc.) have the attraction energy v . The main attention is paid to the computation of the normalized free energy E(n) for intermediate chain lengths n and different ratios a=v/u at sufficiently low temperatures, when the entropic contribution of the loop formation is negligible compared to direct energetic interactions between chain monomers, and when the partition function of the chains is dominated by the ground state. The performed analysis allows one to derive the force f(x) which is necessary to apply for unzipping of two random heteropolymers of equal lengths whose ends are separated by the distance x , averaged over all equally distributed primary structures at low temperatures for fixed values a and c .
Collapse
Affiliation(s)
- M V Tamm
- Physics Department, Moscow State University, Moscow, Russia
| | | |
Collapse
|
14
|
Dean DS, Majumdar SN. Extreme value statistics of eigenvalues of Gaussian random matrices. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2008; 77:041108. [PMID: 18517579 DOI: 10.1103/physreve.77.041108] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2008] [Indexed: 05/26/2023]
Abstract
We compute exact asymptotic results for the probability of the occurrence of large deviations of the largest (smallest) eigenvalue of random matrices belonging to the Gaussian orthogonal, unitary, and symplectic ensembles. In particular, we show that the probability that all the eigenvalues of an (NxN) random matrix are positive (negative) decreases for large N as approximately exp [-beta theta(0)N2] where the Dyson index beta characterizes the ensemble and the exponent theta(0)=(ln 3)/4=0.274653... is universal. We compute the probability that the eigenvalues lie in the interval [zeta1,zeta2] which allows us to calculate the joint probability distribution of the minimum and the maximum eigenvalue. As a by-product, we also obtain exactly the average density of states in Gaussian ensembles whose eigenvalues are restricted to lie in the interval [zeta1,zeta2] , thus generalizing the celebrated Wigner semi-circle law to these restricted ensembles. It is found that the density of states generically exhibits an inverse square-root singularity at the location of the barriers. These results are confirmed by numerical simulations. Some of the results presented in detail here were announced in a previous paper [D. S. Dean and S. N. Majumdar, Phys. Rev. Lett. 97, 160201 (2006)].
Collapse
Affiliation(s)
- David S Dean
- Laboratoire de Physique Théorique UMR 5152 du CNRS, Université Paul Sabatier, 118, route de Narbonne, 31062 Toulouse Cedex 4, France
| | | |
Collapse
|
15
|
Majumdar SN, Mallick K, Nechaev S. Bethe Ansatz in the Bernoulli matching model of random sequence alignment. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2008; 77:011110. [PMID: 18351821 DOI: 10.1103/physreve.77.011110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/03/2007] [Indexed: 05/26/2023]
Abstract
For the Bernoulli matching model of the sequence alignment problem we apply the Bethe Ansatz technique via an exact mapping to the five-vertex model on a square lattice. Considering the terracelike representation of the sequence alignment problem, we reproduce by the Bethe Ansatz the results for the averaged length of the longest common subsequence in the Bernoulli approximation. In addition, we compute the average number of nucleation centers of the terraces.
Collapse
Affiliation(s)
- Satya N Majumdar
- Laboratoire de Physique Théorique et Modèles Statistiques, Université de Paris-Sud, CNRS UMR 8626, 91405 Orsay Cedex, France
| | | | | |
Collapse
|
16
|
Dean DS, Majumdar SN. Large deviations of extreme eigenvalues of random matrices. PHYSICAL REVIEW LETTERS 2006; 97:160201. [PMID: 17155374 DOI: 10.1103/physrevlett.97.160201] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/03/2006] [Indexed: 05/12/2023]
Abstract
We calculate analytically the probability of large deviations from its mean of the largest (smallest) eigenvalue of random matrices belonging to the Gaussian orthogonal, unitary, and symplectic ensembles. In particular, we show that the probability that all the eigenvalues of an (N x N) random matrix are positive (negative) decreases for large N as approximately exp[-betatheta(0)N2] where the parameter beta characterizes the ensemble and the exponent theta(0)=(ln3)/4=0.274 653... is universal. We also calculate exactly the average density of states in matrices whose eigenvalues are restricted to be larger than a fixed number zeta, thus generalizing the celebrated Wigner semicircle law. The density of states generically exhibits an inverse square-root singularity at zeta.
Collapse
Affiliation(s)
- David S Dean
- Laboratoire de Physique Théorique (UMR 5152 du CNRS), Université Paul Sabatier, 118, Route de Narbonne, Toulouse Cedex 4, France
| | | |
Collapse
|
17
|
Sardiu ME, Alves G, Yu YK. Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2005; 72:061917. [PMID: 16485984 DOI: 10.1103/physreve.72.061917] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2005] [Revised: 10/21/2005] [Indexed: 05/06/2023]
Abstract
Sequence alignment is one of the most important bioinformatics tools for modern molecular biology. The statistical characterization of gapped alignment scores has been a long-standing problem in sequence alignment research. Using a variant of the directed path in random media model, we investigate the score statistics of global sequence alignment taking into account, in particular, the compositional bias of the sequences compared. Such statistics are used to distinguish accidental similarity due to compositional similarity from biologically significant similarity. To accommodate the compositional bias, we introduce an extra parameter p indicating the probability for positive matching scores to occur. When p is small, a high scoring alignment obviously cannot come from compositional similarity. When p is large, the highest scoring point within a global alignment tends to be close to the end of both sequences, in which case we say the system percolates. By applying finite-size scaling theory on percolating probability functions of various sizes (sequence lengths), the critical p at infinite size is obtained. For alignment of length t, the fact that the score fluctuation grows as chi(t)1/3 is confirmed upon investigating the scaling form of the alignment score. Using the Kolmogorov-Smirnov statistics test, we show that the random variable , if properly scaled, follows the Tracy-Widom distributions: Gaussian orthogonal ensemble for p slightly larger than pc and Gaussian unitary ensemble for larger p. Although these results deepen our understanding of the distribution of alignment scores, the use of these results in practical applications remains somewhat heuristic and needs to be further developed. Nevertheless, the possibility of characterizing score statistics for modest system size (sequence lengths), via proper reparametrization of alignment scores, is illustrated.
Collapse
Affiliation(s)
- Mihaela E Sardiu
- Department of Physics, Florida Atlantic University, Boca Raton, Florida 33431, USA
| | | | | |
Collapse
|