1
|
Zipf's law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort. Psychon Bull Rev 2023; 30:77-101. [PMID: 35840837 PMCID: PMC9971120 DOI: 10.3758/s13423-022-02142-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/23/2022] [Indexed: 11/08/2022]
Abstract
The ubiquitous inverse relationship between word frequency and word rank is commonly known as Zipf's law. The theoretical underpinning of this law states that the inverse relationship yields decreased effort in both the speaker and hearer, the so-called principle of least effort. Most research has focused on showing an inverse relationship only for written monolog, only for frequencies and ranks of one linguistic unit, generally word unigrams, with strong correlations of the power law to the observed frequency distributions, with limited to no attention to psychological mechanisms such as the principle of least effort. The current paper extends the existing findings, by not focusing on written monolog but on a more fundamental form of communication, spoken dialog, by not only investigating word unigrams but also units quantified on syntactic, pragmatic, utterance, and nonverbal communicative levels by showing that the adequacy of Zipf's formula seems ubiquitous, but the exponent of the power law curve is not, and by placing these findings in the context of Zipf's principle of least effort through redefining effort in terms of cognitive resources available for communication. Our findings show that Zipf's law also applies to a more natural form of communication-that of spoken dialog, that it applies to a range of linguistic units beyond word unigrams, that the general good fit of Zipf's law needs to be revisited in light of the parameters of the formula, and that the principle of least effort is a useful theoretical framework for the findings of Zipf's law.
Collapse
|
2
|
Abstract
In his pioneering research, G. K. Zipf formulated a couple of statistical laws on the relationship between the frequency of a word with its number of meanings: the law of meaning distribution, relating the frequency of a word and its frequency rank, and the meaning-frequency law, relating the frequency of a word with its number of meanings. Although these laws were formulated more than half a century ago, they have been only investigated in a few languages. Here we present the first study of these laws in Catalan. We verify these laws in Catalan via the relationship among their exponents and that of the rank-frequency law. We present a new protocol for the analysis of these Zipfian laws that can be extended to other languages. We report the first evidence of two marked regimes for these laws in written language and speech, paralleling the two regimes in Zipf's rank-frequency law in large multi-author corpora discovered in early 2000s. Finally, the implications of these two regimes will be discussed.
Collapse
|
3
|
Sensitivity to Communication Partners During Naturalistic AAC Conversations in Cantonese Chinese. Front Psychol 2021; 12:686657. [PMID: 34489796 PMCID: PMC8416610 DOI: 10.3389/fpsyg.2021.686657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2021] [Accepted: 07/26/2021] [Indexed: 11/19/2022] Open
Abstract
Previous studies have shown that graphic-based augmentative and alternative communication (AAC) output tend to be short and simple in structure with non-canonical word order, and that AAC users may show differences when communicating with peers compared to professionals such as speech therapists (STs). However, there was a lack of report for graphic-based AAC in the Chinese context, and the effect of communication partners had not been investigated systematically. In this study with 34 AAC users and 10 STs, we reported common and distinct features of free conversations in Cantonese graphic-based AAC, relative to AAC in other languages. We also found that AAC users were sensitive to different types of communication partners. In particular, when conversing with peers, AAC users produced long messages with equal proportion of questions and responses, which suggested active and bi-directional exchanges. In conversations with STs, AAC users showed high diversity in expressive vocabulary, indicating access to more semantic concepts. Results suggested that the base language and the communication partner are both influential factors that should be considered in studies of graphic-based AAC. The mobile AAC system facilitated free conversations in users with complex communication needs, affording an additional channel for social participation.
Collapse
|
4
|
Phylogeny and mechanisms of shared hierarchical patterns in birdsong. Curr Biol 2021; 31:2796-2808.e9. [PMID: 33989526 DOI: 10.1016/j.cub.2021.04.015] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2020] [Revised: 12/14/2020] [Accepted: 04/08/2021] [Indexed: 11/19/2022]
Abstract
Organizational patterns can be shared across biological systems, and revealing the factors shaping common patterns can provide insight into fundamental biological mechanisms. The behavioral pattern that elements with more constituents tend to consist of shorter constituents (Menzerath's law [ML]) was described first in speech and language (e.g., words with more syllables consist of shorter syllables) and subsequently in music and animal communication. Menzerath's law is hypothesized to reflect efficiency in information transfer, but biases and constraints in motor production can also lead to this pattern. We investigated the evolutionary breadth of ML and the contribution of production mechanisms to ML in the songs of 15 songbird species. Negative relationships between the number and duration of constituents (e.g., syllables in phrases) were observed in all 15 species. However, negative relationships were also observed in null models in which constituents were randomly allocated into observed element durations, and the observed negative relationship for numerous species did not differ from the null model; consequently, ML in these species could simply reflect production constraints and not communicative efficiency. By contrast, ML was significantly different from the null model for more than half the cases, suggesting additional organizational rules are imposed onto birdsongs. Production mechanisms are also underscored by the finding that canaries and zebra finches reared without auditory experiences that guide vocal development produced songs with nearly identical ML patterning as typically reared birds. These analyses highlight the breadth with which production mechanisms contribute to this prevalent organizational pattern in behavior.
Collapse
|
5
|
Abstract
Language is a result of brain function; thus, impairment in cognitive function can result in language disorders. Understanding the aging of brain functions in terms of language processing is crucial for modern aging societies. Previous studies have shown that language characteristics, such as verbal fluency, are associated with cognitive functions. However, the scaling laws in language in elderly people remain poorly understood. In the current study, we recorded large-scale data of one million words from group conversations among healthy elderly people and analyzed the relationship between spoken language and cognitive functions in terms of scaling laws, namely, Zipf's law and Heaps' law. We found that word patterns followed these scaling laws irrespective of cognitive function, and that the variations in Heaps' exponents were associated with cognitive function. Moreover, variations in Heaps' exponents were associated with the ratio of new words taken from the other participants' speech. These results indicate that the exponents of scaling laws in language are related to cognitive processes.
Collapse
|
6
|
Distinct flavors of Zipf's law and its maximum likelihood fitting: Rank-size and size-distribution representations. Phys Rev E 2020; 102:052113. [PMID: 33327144 DOI: 10.1103/physreve.102.052113] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 10/18/2020] [Indexed: 11/07/2022]
Abstract
In recent years, researchers have realized the difficulties of fitting power-law distributions properly. These difficulties are higher in Zipfian systems, due to the discreteness of the variables and to the existence of two representations for these systems, i.e., two versions depending on the random variable to fit: rank or size. The discreteness implies that a power law in one of the representations is not a power law in the other, and vice versa. We generate synthetic power laws in both representations and apply a state-of-the-art fitting method to each of the two random variables. The method (based on maximum likelihood plus a goodness-of-fit test) does not fit the whole distribution but the tail, understood as the part of a distribution above a cutoff that separates non-power-law behavior from power-law behavior. We find that, no matter which random variable is power-law distributed, using the rank as the random variable is problematic for fitting, in general (although it may work in some limit cases). One of the difficulties comes from recovering the "hidden" true ranks from the empirical ranks. On the contrary, the representation in terms of the distribution of sizes allows one to recover the true exponent (with some small bias when the underlying size distribution is a power law only asymptotically).
Collapse
|
7
|
From Boltzmann to Zipf through Shannon and Jaynes. ENTROPY 2020; 22:e22020179. [PMID: 33285954 PMCID: PMC7516604 DOI: 10.3390/e22020179] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Revised: 01/31/2020] [Accepted: 02/01/2020] [Indexed: 12/04/2022]
Abstract
The word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens and Bialek (2010), we interpret the frequency of any word as arising from the interaction potentials between its constituent letters. Indeed, Jaynes’ maximum-entropy principle, with the constrains given by every empirical two-letter marginal distribution, leads to a Boltzmann distribution for word probabilities, with an energy-like function given by the sum of the all-to-all pairwise (two-letter) potentials. The so-called improved iterative-scaling algorithm allows us finding the potentials from the empirical two-letter marginals. We considerably extend Stephens and Bialek’s results, applying this formalism to words with length of up to six letters from the English subset of the recently created Standardized Project Gutenberg Corpus. We find that the model is able to reproduce Zipf’s law, but with some limitations: the general Zipf’s power-law regime is obtained, but the probability of individual words shows considerable scattering. In this way, a pure statistical-physics framework is used to describe the probabilities of words. As a by-product, we find that both the empirical two-letter marginal distributions and the interaction-potential distributions follow well-defined statistical laws.
Collapse
|
8
|
Abstract
In this work we consider Glissando Corpus—an oral corpus of Catalan and Spanish—and empirically analyze the presence of the four classical linguistic laws (Zipf’s law, Herdan’s law, Brevity law, and Menzerath–Altmann’s law) in oral communication, and further complement this with the analysis of two recently formulated laws: lognormality law and size-rank law. By aligning the acoustic signal of speech production with the speech transcriptions, we are able to measure and compare the agreement of each of these laws when measured in both physical and symbolic units. Our results show that these six laws are recovered in both languages but considerably more emphatically so when these are examined in physical units, hence reinforcing the so-called ‘physical hypothesis’ according to which linguistic laws might indeed have a physical origin and the patterns recovered in written texts would, therefore, be just a byproduct of the regularities already present in the acoustic signals of oral communication.
Collapse
|
9
|
Predicting the performance of TV series through textual and network analysis: The case of Big Bang Theory. PLoS One 2019; 14:e0225306. [PMID: 31751391 PMCID: PMC6874063 DOI: 10.1371/journal.pone.0225306] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Accepted: 08/25/2019] [Indexed: 11/22/2022] Open
Abstract
TV series represent a growing sector of the entertainment industry. Being able to predict their performance allows a broadcasting network to better focus the high investment needed for their preparation. In this paper, we consider a well known TV series—The Big Bang Theory—to identify factors leading to its success. The factors considered are mostly related to the script, such as the characteristics of dialogues (e.g., length, language complexity, sentiment), while the performance is measured by the reviews submitted by viewers (namely the number of reviews as a measure of popularity and the viewers’ ratings as a measure of appreciation). Through correlation and regression analysis, two sets of predictors are identified respectively for appreciation and popularity. In particular the episode number, the percentage of male viewers, the language complexity and text length emerge as the best predictors for popularity, while again the percentage of male viewers and the language complexity plus the number of we-words and the concentration of dialogues are the best choice for appreciation.
Collapse
|
10
|
On the physical origin of linguistic laws and lognormality in speech. ROYAL SOCIETY OPEN SCIENCE 2019; 6:191023. [PMID: 31598263 PMCID: PMC6731709 DOI: 10.1098/rsos.191023] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Accepted: 07/23/2019] [Indexed: 06/10/2023]
Abstract
Physical manifestations of linguistic units include sources of variability due to factors of speech production which are by definition excluded from counts of linguistic symbols. In this work, we examine whether linguistic laws hold with respect to the physical manifestations of linguistic units in spoken English. The data we analyse come from a phonetically transcribed database of acoustic recordings of spontaneous speech known as the Buckeye Speech corpus. First, we verify with unprecedented accuracy that acoustically transcribed durations of linguistic units at several scales comply with a lognormal distribution, and we quantitatively justify this 'lognormality law' using a stochastic generative model. Second, we explore the four classical linguistic laws (Zipf's Law, Herdan's Law, Brevity Law and Menzerath-Altmann's Law (MAL)) in oral communication, both in physical units and in symbolic units measured in the speech transcriptions, and find that the validity of these laws is typically stronger when using physical units than in their symbolic counterpart. Additional results include (i) coining a Herdan's Law in physical units, (ii) a precise mathematical formulation of Brevity Law, which we show to be connected to optimal compression principles in information theory and allows to formulate and validate yet another law which we call the size-rank law or (iii) a mathematical derivation of MAL which also highlights an additional regime where the law is inverted. Altogether, these results support the hypothesis that statistical laws in language have a physical origin.
Collapse
|
11
|
|
12
|
Long-Range Correlation Underlying Childhood Language and Generative Models. Front Psychol 2018; 9:1725. [PMID: 30283378 PMCID: PMC6157415 DOI: 10.3389/fpsyg.2018.01725] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Accepted: 08/27/2018] [Indexed: 11/13/2022] Open
Abstract
Long-range correlation, a property of time series exhibiting relevant statistical dependence between two distant subsequences, is mainly studied in the statistical physics domain and has been reported to exist in natural language. By using a state-of-the-art method for such analysis, long-range correlation is first shown to occur in long CHILDES data sets. To understand why, generative stochastic models of language, originally proposed in the cognitive scientific domain, are investigated. Among representative models, the Simon model is found to exhibit surprisingly good long-range correlation, but not the Pitman-Yor model. Because the Simon model is known not to correctly reflect the vocabulary growth of natural languages, a simple new model is devised as a conjunct of the Simon and Pitman-Yor models, such that long-range correlation holds with a correct vocabulary growth rate. The investigation overall suggests that uniform sampling is one cause of long-range correlation and could thus have some relation with actual linguistic processes.
Collapse
|
13
|
Abstract
What is the nature of language? How has it evolved in different species? Are there qualitative, well-defined classes of languages? Most studies of language evolution deal in a way or another with such theoretical contraption and explore the outcome of diverse forms of selection on the communication matrix that somewhat optimizes communication. This framework naturally introduces networks mediating the communicating agents, but no systematic analysis of the underlying landscape of possible language graphs has been developed. Here we present a detailed analysis of network properties on a generic model of a communication code, which reveals a rather complex and heterogeneous morphospace of language graphs. Additionally, we use curated data of English words to locate and evaluate real languages within this morphospace. Our findings indicate a surprisingly simple structure in human language unless particles with the ability of naming any other concept are introduced in the vocabulary. These results refine and for the first time complement with empirical data a lasting theoretical tradition around the framework of least effort language.
Collapse
|
14
|
Abstract
Linguistic laws constitute one of the quantitative cornerstones of modern cognitive sciences and have been routinely investigated in written corpora, or in the equivalent transcription of oral corpora. This means that inferences of statistical patterns of language in acoustics are biased by the arbitrary, language-dependent segmentation of the signal, and virtually precludes the possibility of making comparative studies between human voice and other animal communication systems. Here we bridge this gap by proposing a method that allows to measure such patterns in acoustic signals of arbitrary origin, without needs to have access to the language corpus underneath. The method has been applied to sixteen different human languages, recovering successfully some well-known laws of human communication at timescales even below the phoneme and finding yet another link between complexity and criticality in a biological system. These methods further pave the way for new comparative studies in animal communication or the analysis of signals of unknown code.
Collapse
|
15
|
|
16
|
Zipf's Law: Balancing Signal Usage Cost and Communication Efficiency. PLoS One 2015; 10:e0139475. [PMID: 26427059 PMCID: PMC4591018 DOI: 10.1371/journal.pone.0139475] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2015] [Accepted: 09/14/2015] [Indexed: 11/18/2022] Open
Abstract
We propose a model that explains the reliable emergence of power laws (e.g., Zipf's law) during the development of different human languages. The model incorporates the principle of least effort in communications, minimizing a combination of the information-theoretic communication inefficiency and direct signal cost. We prove a general relationship, for all optimal languages, between the signal cost distribution and the resulting distribution of signals. Zipf's law then emerges for logarithmic signal cost distributions, which is the cost distribution expected for words constructed from letters or phonemes.
Collapse
|
17
|
Abstract
We demonstrate a substantial evidence that the word length can be an essential lexical structural feature for word evolution in written Chinese. The data used in this study are diachronic Chinese short narrative texts with a time span of over 2000-years. We show that the increase of word length is an essential regularity in word evolution. On the one hand, word frequency is found to depend on word length, and their relation is in line with the Power law function y = ax-b. On the other hand, our deeper analyses show that the increase of word length results in the simplification in characters for balance in written Chinese. Moreover, the correspondence between written and spoken Chinese is discussed. We conclude that the disyllabic trend may account for the increase of word length, and its impacts can be explained in "the principle of least effort".
Collapse
|
18
|
Abstract
Statistical methods have been widely employed to study the fundamental properties of language. In recent years, methods from complex and dynamical systems proved useful to create several language models. Despite the large amount of studies devoted to represent texts with physical models, only a limited number of studies have shown how the properties of the underlying physical systems can be employed to improve the performance of natural language processing tasks. In this paper, I address this problem by devising complex networks methods that are able to improve the performance of current statistical methods. Using a fuzzy classification strategy, I show that the topological properties extracted from texts complement the traditional textual description. In several cases, the performance obtained with hybrid approaches outperformed the results obtained when only traditional or networked methods were used. Because the proposed model is generic, the framework devised here could be straightforwardly used to study similar textual applications where the topology plays a pivotal role in the description of the interacting agents.
Collapse
|
19
|
Zipf's Law for Word Frequencies: Word Forms versus Lemmas in Long Texts. PLoS One 2015; 10:e0129031. [PMID: 26158787 PMCID: PMC4497678 DOI: 10.1371/journal.pone.0129031] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2014] [Accepted: 05/04/2015] [Indexed: 11/19/2022] Open
Abstract
Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf's law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.
Collapse
|
20
|
Adaptive Communication: Languages with More Non-Native Speakers Tend to Have Fewer Word Forms. PLoS One 2015; 10:e0128254. [PMID: 26083380 PMCID: PMC4470635 DOI: 10.1371/journal.pone.0128254] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Accepted: 04/23/2015] [Indexed: 11/19/2022] Open
Abstract
Explaining the diversity of languages across the world is one of the central aims of typological, historical, and evolutionary linguistics. We consider the effect of language contact-the number of non-native speakers a language has-on the way languages change and evolve. By analysing hundreds of languages within and across language families, regions, and text types, we show that languages with greater levels of contact typically employ fewer word forms to encode the same information content (a property we refer to as lexical diversity). Based on three types of statistical analyses, we demonstrate that this variance can in part be explained by the impact of non-native speakers on information encoding strategies. Finally, we argue that languages are information encoding systems shaped by the varying needs of their speakers. Language evolution and change should be modeled as the co-evolution of multiple intertwined adaptive systems: On one hand, the structure of human societies and human learning capabilities, and on the other, the structure of language.
Collapse
|
21
|
Organic Chemistry as a Language and the Implications of Chemical Linguistics for Structural and Retrosynthetic Analyses. Angew Chem Int Ed Engl 2014. [DOI: 10.1002/ange.201403708] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
22
|
Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angew Chem Int Ed Engl 2014; 53:8108-12. [PMID: 25044611 DOI: 10.1002/anie.201403708] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2014] [Revised: 07/01/2014] [Indexed: 11/05/2022]
Abstract
Methods of computational linguistics are used to demonstrate that a natural language such as English and organic chemistry have the same structure in terms of the frequency of, respectively, text fragments and molecular fragments. This quantitative correspondence suggests that it is possible to extend the methods of computational corpus linguistics to the analysis of organic molecules. It is shown that within organic molecules bonds that have highest information content are the ones that 1) define repeat/symmetry subunits and 2) in asymmetric molecules, define the loci of potential retrosynthetic disconnections. Linguistics-based analysis appears well-suited to the analysis of complex structural and reactivity patterns within organic molecules.
Collapse
|
23
|
A common construction pattern of English words and Chinese characters. PLoS One 2013; 8:e74515. [PMID: 24023946 PMCID: PMC3759465 DOI: 10.1371/journal.pone.0074515] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2013] [Accepted: 08/05/2013] [Indexed: 12/04/2022] Open
Abstract
Rankings are ubiquitous around the world. Here I investigate spatial ranking patterns of English Words and Chinese Characters, and reveal a common construction pattern related to phase separation. In detail, I analyze a list of different words in the English language, and find that the frequency of the number of letters per word linearly or nonlinearly decays over its rank in the frequency table. I interpret the linearly decaying area as a linear phase that covers 96.4% words, which is in sharp contrast to a nonlinear phase (representing the nonlinearly decaying area) that covers the remaining 3.6% words. Amazingly, the phase separation phenomenon with the same two percentages of 96.4% and 3.6% holds also for the relation between strokes and characters in the Chinese language although English and Chinese are two distinctly different language systems. The common construction pattern originates from the log-normal distributions of frequencies of words or characters, which can be understood by the joint effect of both the Weber-Fechner law in psychophysics and the principle of maximum entropy in information theory.
Collapse
|