1
|
Di Marco N, Loru E, Bonetti A, Serra AOG, Cinelli M, Quattrociocchi W. Patterns of linguistic simplification on social media platforms over time. Proc Natl Acad Sci U S A 2024; 121:e2412105121. [PMID: 39642198 DOI: 10.1073/pnas.2412105121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Accepted: 11/07/2024] [Indexed: 12/08/2024] Open
Abstract
Understanding the impact of digital platforms on user behavior presents foundational challenges, including issues related to polarization, misinformation dynamics, and variation in news consumption. Comparative analyses across platforms and over different years can provide critical insights into these phenomena. This study investigates the linguistic characteristics of user comments over 34 y, focusing on their complexity and temporal shifts. Using a dataset of approximately 300 million English comments from eight diverse platforms and topics, we examine user communications' vocabulary size and linguistic richness and their evolution over time. Our findings reveal consistent patterns of complexity across social media platforms and topics, characterized by a nearly universal reduction in text length, diminished lexical richness, and decreased repetitiveness. Despite these trends, users consistently introduce new words into their comments at a nearly constant rate. This analysis underscores that platforms only partially influence the complexity of user comments but, instead, it reflects a broader pattern of linguistic change driven by social triggers, suggesting intrinsic tendencies in users' online interactions comparable to historically recognized linguistic hybridization and contamination processes.
Collapse
Affiliation(s)
- N Di Marco
- Department of Computer Science, Sapienza University of Rome, Roma 00161, Italy
| | - Edoardo Loru
- Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome 00185, Italy
| | - Anita Bonetti
- Department of Communication and Social Research, Roma CAP 00198, Italia
| | - Alessandra Olga Grazia Serra
- Tuscia University - Dipartimento di studi linguistico-letterari, storico-filosofici e giuridici (DISTU) Department of Modern Languages and Literatures, History, Philosophy and Law Studies, Viterbo 01100, Italy
| | - Matteo Cinelli
- Department of Computer Science, Sapienza University of Rome, Roma 00161, Italy
| | | |
Collapse
|
2
|
Li W, Almirantis Y, Provata A. Range-limited Heaps' law for functional DNA words in the human genome. J Theor Biol 2024; 592:111878. [PMID: 38901778 DOI: 10.1016/j.jtbi.2024.111878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/31/2024] [Accepted: 06/10/2024] [Indexed: 06/22/2024]
Abstract
Heaps' or Herdan-Heaps' law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the Heaps' law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps' law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps' law was observed, but a quadratic regression in log-log type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.
Collapse
Affiliation(s)
- Wentian Li
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA(1); The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Yannis Almirantis
- Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and Applications, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| | - Astero Provata
- Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| |
Collapse
|
3
|
Pilgrim C, Guo W, Hills TT. The rising entropy of English in the attention economy. COMMUNICATIONS PSYCHOLOGY 2024; 2:70. [PMID: 39242771 PMCID: PMC11332035 DOI: 10.1038/s44271-024-00117-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 07/11/2024] [Indexed: 09/09/2024]
Abstract
We present evidence that the word entropy of American English has been rising steadily since around 1900. We also find differences in word entropy between media categories, with short-form media such as news and magazines having higher entropy than long-form media, and social media feeds having higher entropy still. To explain these results we develop an ecological model of the attention economy that combines ideas from Zipf's law and information foraging. In this model, media consumers maximize information utility rate taking into account the costs of information search, while media producers adapt to technologies that reduce search costs, driving them to generate higher entropy content in increasingly shorter formats.
Collapse
Affiliation(s)
- Charlie Pilgrim
- Mathematics, University of Leeds, Leeds, UK.
- The Mathematics of Real-World Systems CDT, The University of Warwick, Coventry, UK.
- Experimental Psychology, University College London, London, UK.
- The Alan Turing Institute, London, UK.
| | - Weisi Guo
- The Alan Turing Institute, London, UK
- Human Machine Intelligence Group, Cranfield University, Bedford, UK
| | - Thomas T Hills
- The Alan Turing Institute, London, UK
- Department of Psychology, The University of Warwick, Coventry, UK
| |
Collapse
|
4
|
Enfield NJ. Scale in Language. Cogn Sci 2023; 47:e13341. [PMID: 37823747 DOI: 10.1111/cogs.13341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 08/28/2023] [Accepted: 08/30/2023] [Indexed: 10/13/2023]
Abstract
A central concern of the cognitive science of language since its origins has been the concept of the linguistic system. Recent approaches to the system concept in language point to the exceedingly complex relations that hold between many kinds of interdependent systems, but it can be difficult to know how to proceed when "everything is connected." This paper offers a framework for tackling that challenge by identifying *scale* as a conceptual mooring for the interdisciplinary study of language systems. The paper begins by defining the scale concept-simply, the possibility for a measure to be larger or smaller in different instances of a system, such as a phonemic inventory, a word's frequency value in a corpus, or a speaker population. We review sites of scale difference in and across linguistic subsystems, drawing on findings from linguistic typology, grammatical description, morphosyntactic theory, psycholinguistics, computational corpus work, and social network demography. We consider possible explanations for scaling differences and constraints in language. We then turn to the question of *dependencies between* sites of scale difference in language, reviewing four sample domains of scale dependency: in phonological systems, across levels of grammatical structure (Menzerath's Law), in corpora (Zipf's Law and related issues), and in speaker population size. Finally, we consider the implications of the review, including the utility of a scale framework for generating new questions and inspiring methodological innovations and interdisciplinary collaborations in cognitive-scientific research on language.
Collapse
Affiliation(s)
- N J Enfield
- Discipline of Linguistics, The University of Sydney
| |
Collapse
|
5
|
Kim H, Park S, Jeong M, Byun H, Kim J, Lee DY, Jeon J, Yi E, Ahn K. Scaling behavior and text cohesion in Korean texts. PLoS One 2023; 18:e0290168. [PMID: 37651361 PMCID: PMC10470962 DOI: 10.1371/journal.pone.0290168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2022] [Accepted: 08/02/2023] [Indexed: 09/02/2023] Open
Abstract
This study examines whether different types of texts, particularly in Korean, can be distinguished by the scaling exponent and degree of text cohesion. We use the controlled growth process model to incorporate the interaction effect into a power-law distribution and estimate the implied parameter explaining the degree of text cohesiveness in a word distribution. We find that the word distributions of Korean languages differ from English regarding the range of scaling exponents. Additionally, different types of Korean texts display similar scaling exponents regardless of their genre. However, the interaction effect is higher for expert reports than for the benchmark novels. The findings suggest a valid framework for explaining the scaling phenomena of word distribution based on microscale interactions. It also suggests that a viable method exists for inferring text genres based on text cohesion.
Collapse
Affiliation(s)
- Hokyun Kim
- Korea Advanced Institute of Science and Technology, Moon Soul Graduate School of Future Strategy, Daejeon, South Korea
| | - Sanghu Park
- Department of Industrial Engineering and Center for Finance and Technology, Yonsei University, Seoul, South Korea
| | - Minhyuk Jeong
- Department of Industrial Engineering and Center for Finance and Technology, Yonsei University, Seoul, South Korea
| | - Hyungi Byun
- FNC Technology Co., Ltd., Gyeonggi-do, South, Korea
| | - Juyub Kim
- Department of Industrial Engineering and Center for Finance and Technology, Yonsei University, Seoul, South Korea
- FNC Technology Co., Ltd., Gyeonggi-do, South, Korea
| | - Doo Yong Lee
- FNC Technology Co., Ltd., Gyeonggi-do, South, Korea
| | - Jooyoung Jeon
- Korea Advanced Institute of Science and Technology, Moon Soul Graduate School of Future Strategy, Daejeon, South Korea
| | - Eojin Yi
- Seoul Business School, aSSIST University, Seoul, South Korea
| | - Kwangwon Ahn
- Department of Industrial Engineering and Center for Finance and Technology, Yonsei University, Seoul, South Korea
| |
Collapse
|
6
|
Budel G, Jin Y, Van Mieghem P, Kitsak M. Topological properties and organizing principles of semantic networks. Sci Rep 2023; 13:11728. [PMID: 37474614 PMCID: PMC10359341 DOI: 10.1038/s41598-023-37294-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 06/19/2023] [Indexed: 07/22/2023] Open
Abstract
Interpreting natural language is an increasingly important task in computer algorithms due to the growing availability of unstructured textual data. Natural Language Processing (NLP) applications rely on semantic networks for structured knowledge representation. The fundamental properties of semantic networks must be taken into account when designing NLP algorithms, yet they remain to be structurally investigated. We study the properties of semantic networks from ConceptNet, defined by 7 semantic relations from 11 different languages. We find that semantic networks have universal basic properties: they are sparse, highly clustered, and many exhibit power-law degree distributions. Our findings show that the majority of the considered networks are scale-free. Some networks exhibit language-specific properties determined by grammatical rules, for example networks from highly inflected languages, such as e.g. Latin, German, French and Spanish, show peaks in the degree distribution that deviate from a power law. We find that depending on the semantic relation type and the language, the link formation in semantic networks is guided by different principles. In some networks the connections are similarity-based, while in others the connections are more complementarity-based. Finally, we demonstrate how knowledge of similarity and complementarity in semantic networks can improve NLP algorithms in missing link inference.
Collapse
Affiliation(s)
- Gabriel Budel
- Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, 2628 CD, Delft, The Netherlands
| | - Ying Jin
- Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, 2628 CD, Delft, The Netherlands
| | - Piet Van Mieghem
- Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, 2628 CD, Delft, The Netherlands
| | - Maksim Kitsak
- Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, 2628 CD, Delft, The Netherlands.
| |
Collapse
|
7
|
Staples TL. Expansion and evolution of the R programming language. ROYAL SOCIETY OPEN SCIENCE 2023; 10:221550. [PMID: 37063989 PMCID: PMC10090872 DOI: 10.1098/rsos.221550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 03/23/2023] [Indexed: 06/19/2023]
Abstract
Languages change over time, driven by creation of new words and cultural pressure to optimize communication. Programming languages resemble written language but communicate primarily with computer hardware rather than a human audience. I tested whether there were detectable changes over time in use of R, a mature, open-source programming language used for scientific computing. Across 393 142 GitHub repositories published between 2014 and 2021, I extracted 143 409 288 R functions, programming 'verbs', pairing linguistic and ecological analyses to detect change to diversity and composition of functions used over time. I found the number of R functions in use increased and underwent substantial change, driven primarily by the popularity of the 'tidyverse' collection of community-written extensions. I provide evidence that users can change the nature of programming languages, with patterns that match known processes from natural languages and genetic evolution. In R, there appear to be selective pressures for increased analytic complexity and R functions in decline that are not yet extinct (extinction debts). R's evolution towards the tidyverse may also represent the start of a division into two distinct dialects, which may impact the readability and continuity of analytic and scientific inquiries codified in R, as well as the language's future.
Collapse
Affiliation(s)
- Timothy L. Staples
- School of Biological Sciences, The University of Queensland, Building 60, St Lucia, Queensland 4072, Australia
| |
Collapse
|
8
|
Lavi-Rotbain O, Arnon I. Zipfian Distributions in Child-Directed Speech. Open Mind (Camb) 2023; 7:1-30. [PMID: 36891353 PMCID: PMC9987348 DOI: 10.1162/opmi_a_00070] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2022] [Accepted: 11/30/2022] [Indexed: 12/23/2022] Open
Abstract
Across languages, word frequency and rank follow a power law relation, forming a distribution known as the Zipfian distribution. There is growing experimental evidence that this well-studied phenomenon may be beneficial for language learning. However, most investigations of word distributions in natural language have focused on adult-to-adult speech: Zipf's law has not been thoroughly evaluated in child-directed speech (CDS) across languages. If Zipfian distributions facilitate learning, they should also be found in CDS. At the same time, several unique properties of CDS may result in a less skewed distribution. Here, we examine the frequency distribution of words in CDS in three studies. We first show that CDS is Zipfian across 15 languages from seven language families. We then show that CDS is Zipfian from early on (six-months) and across development for five languages with sufficient longitudinal data. Finally, we show that the distribution holds across different parts of speech: Nouns, verbs, adjectives and prepositions follow a Zipfian distribution. Together, the results show that the input children hear is skewed in a particular way from early on, providing necessary (but not sufficient) support for the postulated learning advantage of such skew. They highlight the need to study skewed learning environments experimentally.
Collapse
Affiliation(s)
- Ori Lavi-Rotbain
- The Edmond and Lilly Safra Center for Brain Sciences, Hebrew University, Jerusalem, Israel
| | - Inbal Arnon
- Department of Psychology, Hebrew University, Jerusalem, Israel
| |
Collapse
|
9
|
Holdaway C, Piantadosi ST. Stochastic Time-Series Analyses Highlight the Day-To-Day Dynamics of Lexical Frequencies. Cogn Sci 2022; 46:e13215. [PMID: 36515373 DOI: 10.1111/cogs.13215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 08/25/2022] [Accepted: 10/09/2022] [Indexed: 12/15/2022]
Abstract
Standard models in quantitative linguistics assume that word usage follows a fixed frequency distribution, often Zipf's law or a close relative. This view, however, does not capture the near daily variations in topics of conversation, nor the short-term dynamics of language change. In order to understand the dynamics of human language use, we present a corpus of daily word frequency variation scraped from online news sources every 20 min for more than 2 years. We construct a simple time-varying model with a latent state, which is observed via word frequency counts. We use Bayesian techniques to infer the parameters of this model for 20,000 words, allowing us to convert complex word-frequency trajectories into low-dimensional parameters in word usage. By analyzing the inferred parameters of this model, we quantify the relative mobility and drift of words on a day-to-day basis, while accounting for sampling error. We quantify this variation and show evidence against "rich-get-richer" models of word use, which have been previously hypothesized to explain statistical patterns in language.
Collapse
|
10
|
The Evolution of Sustainability Ideas in China from 1946 to 2015, Quantified by Culturomics. SUSTAINABILITY 2022. [DOI: 10.3390/su14106038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Economy and ecology are two main aspects of human sustainable development. However, a comprehensive analysis of the status and trends of economic and ecological cognition is still lacking. Here, we defined economic and ecological concepts as cultural traits that constitute a complex system representing sustainability ideas. Adopting a linguistic ecology perspective, we analysed the frequency distribution, turnover and innovation rates of 3713 concepts appearing in China’s mainstream newspaper, People’s Daily, from 1946 to 2015. Results reveal that: (1) In the whole historical period, there were more economic concepts than ecological concepts both in amount and category. Economic concepts experienced stronger cultural drift than ecological concepts tested by the neutral model of cultural evolution; (2) popular economic concepts became more diversified, but popular ecological concepts became more uniform; (3) both economic concepts and ecological concepts attained more variation in their own disciplinary domains than in cross-disciplinary domains; and (4) as a platform of both giving information and opinion, a newspaper is subjected to cultural selection, especially reflected in the change in ecological concepts under the context of Chinese ecological civilization construction. We concluded with a discussion of promoting vibrant and resilient ecological knowledge in fostering sustainability activities and behaviours.
Collapse
|
11
|
Troumbis AY, Iosifidis S, Kalloniatis C. Uncovering patterns of public perceptions towards biodiversity crime using conservation culturomics. CRIME, LAW, AND SOCIAL CHANGE 2022; 78:405-426. [PMID: 35529301 PMCID: PMC9055009 DOI: 10.1007/s10611-022-10028-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 03/21/2022] [Indexed: 06/14/2023]
Abstract
This paper examines aspects of the relationship between (1) the recently typified form of biodiversity crime, (2) information made available to the public through the Internet, and (3) cultural dynamics quantified through info-surveillance methods through Culturomics techniques. We propose two conceptual models: (1) the building-up process of a biodiversity crime culturome, in some language, and (2) a multi-stage biodiversity conservation chain and biodiversity-crime activities relating to each stage. We use crowd search volumes on the Internet on biodiversity crime-related terms and topics as proxies for measuring public interest. The main findings are: (1) the concept of biodiversity-crime per se is still immature and presents low penetration to the general public; (2) biodiversity-crime issues, not recognized as such, are amalgamated in conservation-oriented websites and pages; and (3) differences in perceptions and priorities between general vs. niche public with particular interest(s) in environmental issues- are discernable.
Collapse
Affiliation(s)
- Andreas Y. Troumbis
- Biodiversity Conservation Laboratory, Department of the Environment, University of the Aegean, 81100 Mytilini, Greece
| | - Spyridon Iosifidis
- Biodiversity Conservation Laboratory, Department of the Environment, University of the Aegean, 81100 Mytilini, Greece
| | - Christos Kalloniatis
- Privacy Engineering and Social Informatics Laboratory, Dept. of Cultural Technology and Communication, University of the Aegean, Mitilini, Greece
| |
Collapse
|
12
|
Lognormals, power laws and double power laws in the distribution of frequencies of harmonic codewords from classical music. Sci Rep 2022; 12:2615. [PMID: 35173194 PMCID: PMC8850585 DOI: 10.1038/s41598-022-06137-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 01/21/2022] [Indexed: 11/08/2022] Open
Abstract
Zipf's law is a paradigm describing the importance of different elements in communication systems, especially in linguistics. Despite the complexity of the hierarchical structure of language, music has in some sense an even more complex structure, due to its multidimensional character (melody, harmony, rhythm, timbre, etc.). Thus, the relevance of Zipf's law in music is still an open question. Using discrete codewords representing harmonic content obtained from a large-scale analysis of classical composers, we show that a nearly universal Zipf-like law holds at a qualitative level. However, in an in-depth quantitative analysis, where we introduce the double power-law distribution as a new player in the classical debate between the superiority of Zipf's (power) law and that of the lognormal distribution, we conclude not only that universality does not hold, but also that there is not a unique probability distribution that best describes the usage of the different codewords by each composer.
Collapse
|
13
|
Català N, Baixeries J, Ferrer-i-Cancho R, Padró L, Hernández-Fernández A. Zipf's laws of meaning in Catalan. PLoS One 2021; 16:e0260849. [PMID: 34914766 PMCID: PMC8675765 DOI: 10.1371/journal.pone.0260849] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2021] [Accepted: 11/17/2021] [Indexed: 11/19/2022] Open
Abstract
In his pioneering research, G. K. Zipf formulated a couple of statistical laws on the relationship between the frequency of a word with its number of meanings: the law of meaning distribution, relating the frequency of a word and its frequency rank, and the meaning-frequency law, relating the frequency of a word with its number of meanings. Although these laws were formulated more than half a century ago, they have been only investigated in a few languages. Here we present the first study of these laws in Catalan. We verify these laws in Catalan via the relationship among their exponents and that of the rank-frequency law. We present a new protocol for the analysis of these Zipfian laws that can be extended to other languages. We report the first evidence of two marked regimes for these laws in written language and speech, paralleling the two regimes in Zipf's rank-frequency law in large multi-author corpora discovered in early 2000s. Finally, the implications of these two regimes will be discussed.
Collapse
Affiliation(s)
- Neus Català
- TALP Research Center, Computer Science Departament, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
| | - Jaume Baixeries
- LARCA Research Group, Complexity and Quantitative Linguistics Laboratory, Computer Science Departament, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
| | - Ramon Ferrer-i-Cancho
- LARCA Research Group, Complexity and Quantitative Linguistics Laboratory, Computer Science Departament, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
| | - Lluís Padró
- TALP Research Center, Computer Science Departament, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
| | - Antoni Hernández-Fernández
- LARCA Research Group, Complexity and Quantitative Linguistics Laboratory, Computer Science Departament, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
- Societat Catalana de Tecnologia, Secció de Ciències i Tecnologia, Institut d’Estudis Catalans - Catalan Studies Institute, Barcelona, Catalonia, Spain
- * E-mail:
| |
Collapse
|
14
|
Aletti G, Crimaldi I. Twitter as an innovation process with damping effect. Sci Rep 2021; 11:21243. [PMID: 34711859 PMCID: PMC8553952 DOI: 10.1038/s41598-021-00378-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 10/11/2021] [Indexed: 11/23/2022] Open
Abstract
In the existing literature about innovation processes, the proposed models often satisfy the Heaps' law, regarding the rate at which novelties appear, and the Zipf's law, that states a power law behavior for the frequency distribution of the elements. However, there are empirical cases far from showing a pure power law behavior and such a deviation is mostly present for elements with high frequencies. We explain this phenomenon by means of a suitable "damping" effect in the probability of a repetition of an old element. We introduce an extremely general model, whose key element is the update function, that can be suitably chosen in order to reproduce the behaviour exhibited by the empirical data. In particular, we explicit the update function for some Twitter data sets and show great performances with respect to Heaps' law and, above all, with respect to the fitting of the frequency-rank plots for low and high frequencies. Moreover, we also give other examples of update functions, that are able to reproduce the behaviors empirically observed in other contexts.
Collapse
Affiliation(s)
- Giacomo Aletti
- ADAMSS Center, Università degli Studi di Milano, Milan, Italy.
| | | |
Collapse
|
15
|
Caetano-Anollés G. The Compressed Vocabulary of Microbial Life. Front Microbiol 2021; 12:655990. [PMID: 34305827 PMCID: PMC8292947 DOI: 10.3389/fmicb.2021.655990] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/27/2021] [Indexed: 12/22/2022] Open
Abstract
Communication is an undisputed central activity of life that requires an evolving molecular language. It conveys meaning through messages and vocabularies. Here, I explore the existence of a growing vocabulary in the molecules and molecular functions of the microbial world. There are clear correspondences between the lexicon, syntax, semantics, and pragmatics of language organization and the module, structure, function, and fitness paradigms of molecular biology. These correspondences are constrained by universal laws and engineering principles. Macromolecular structure, for example, follows quantitative linguistic patterns arising from statistical laws that are likely universal, including the Zipf's law, a special case of the scale-free distribution, the Heaps' law describing sublinear growth typical of economies of scales, and the Menzerath-Altmann's law, which imposes size-dependent patterns of decreasing returns. Trade-off solutions between principles of economy, flexibility, and robustness define a "triangle of persistence" describing the impact of the environment on a biological system. The pragmatic landscape of the triangle interfaces with the syntax and semantics of molecular languages, which together with comparative and evolutionary genomic data can explain global patterns of diversification of cellular life. The vocabularies of proteins (proteomes) and functions (functionomes) revealed a significant universal lexical core supporting a universal common ancestor, an ancestral evolutionary link between Bacteria and Eukarya, and distinct reductive evolutionary strategies of language compression in Archaea and Bacteria. A "causal" word cloud strategy inspired by the dependency grammar paradigm used in catenae unfolded the evolution of lexical units associated with Gene Ontology terms at different levels of ontological abstraction. While Archaea holds the smallest, oldest, and most homogeneous vocabulary of all superkingdoms, Bacteria heterogeneously apportions a more complex vocabulary, and Eukarya pushes functional innovation through mechanisms of flexibility and robustness.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, and C. R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL, United States
| |
Collapse
|
16
|
Meylan SC, Griffiths TL. The Challenges of Large-Scale, Web-Based Language Datasets: Word Length and Predictability Revisited. Cogn Sci 2021; 45:e12983. [PMID: 34170030 DOI: 10.1111/cogs.12983] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Revised: 03/16/2021] [Accepted: 04/07/2021] [Indexed: 11/28/2022]
Abstract
Language research has come to rely heavily on large-scale, web-based datasets. These datasets can present significant methodological challenges, requiring researchers to make a number of decisions about how they are collected, represented, and analyzed. These decisions often concern long-standing challenges in corpus-based language research, including determining what counts as a word, deciding which words should be analyzed, and matching sets of words across languages. We illustrate these challenges by revisiting "Word lengths are optimized for efficient communication" (Piantadosi, Tily, & Gibson, 2011), which found that word lengths in 11 languages are more strongly correlated with their average predictability (or average information content) than their frequency. Using what we argue to be best practices for large-scale corpus analyses, we find significantly attenuated support for this result and demonstrate that a stronger relationship obtains between word frequency and length for a majority of the languages in the sample. We consider the implications of the results for language research more broadly and provide several recommendations to researchers regarding best practices.
Collapse
Affiliation(s)
- Stephan C Meylan
- Department of Brain and Cognitive Science, Massachusetts Institute of Technology.,Department of Psychology and Neuroscience, Duke University
| | | |
Collapse
|
17
|
Vera J, Urbina F, Palma W. Formation of vocabularies in a decentralized graph-based approach to human language. Phys Rev E 2021; 103:022129. [PMID: 33736099 DOI: 10.1103/physreve.103.022129] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Accepted: 01/26/2021] [Indexed: 11/07/2022]
Abstract
Zipf's law establishes a scaling behavior for word frequencies in large text corpora. The appearance of Zipfian properties in vocabularies (viewed as an intermediate phase between referentially useless one-word systems and one-to-one word-meaning vocabularies) has been previously explained as an optimization problem for the interests of speakers and hearers. Remarkably, humanlike vocabularies can be viewed also as bipartite graphs. Thus, the aim here is double: within a bipartite-graph approach to human vocabularies, to propose a decentralized language game model for the formation of Zipfian properties. To do this, we define a language game in which a population of artificial agents is involved in idealized linguistic interactions. Numerical simulations show the appearance of a drastic transition from an initially disordered state towards three kinds of vocabularies. Our results open ways to study Zipfian properties in language, reconciling models seeing communication as a global minima of information entropic energies and models focused on self-organization.
Collapse
Affiliation(s)
- Javier Vera
- Pontificia Universidad Católica de Valparaíso, Valparaíso 2340025, Chile
| | - Felipe Urbina
- Centro de Investigación DAiTA Lab Facultad de Estudios Interdisciplinarios, Universidad Mayor, Santiago 7560913, Chile
| | - Wenceslao Palma
- Escuela de Ingeniería Informática Pontificia, Universidad Católica de Valparaíso, Valparaíso 2362807, Chile
| |
Collapse
|
18
|
Corral Á, Serra I, Ferrer-I-Cancho R. Distinct flavors of Zipf's law and its maximum likelihood fitting: Rank-size and size-distribution representations. Phys Rev E 2020; 102:052113. [PMID: 33327144 DOI: 10.1103/physreve.102.052113] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 10/18/2020] [Indexed: 11/07/2022]
Abstract
In recent years, researchers have realized the difficulties of fitting power-law distributions properly. These difficulties are higher in Zipfian systems, due to the discreteness of the variables and to the existence of two representations for these systems, i.e., two versions depending on the random variable to fit: rank or size. The discreteness implies that a power law in one of the representations is not a power law in the other, and vice versa. We generate synthetic power laws in both representations and apply a state-of-the-art fitting method to each of the two random variables. The method (based on maximum likelihood plus a goodness-of-fit test) does not fit the whole distribution but the tail, understood as the part of a distribution above a cutoff that separates non-power-law behavior from power-law behavior. We find that, no matter which random variable is power-law distributed, using the rank as the random variable is problematic for fitting, in general (although it may work in some limit cases). One of the difficulties comes from recovering the "hidden" true ranks from the empirical ranks. On the contrary, the representation in terms of the distribution of sizes allows one to recover the true exponent (with some small bias when the underlying size distribution is a power law only asymptotically).
Collapse
Affiliation(s)
- Álvaro Corral
- Centre de Recerca Matemàtica, Edifici C, Campus Bellaterra, E-08193 Barcelona, Spain.,Departament de Matemàtiques, Facultat de Ciències, Universitat Autònoma de Barcelona, E-08193 Barcelona, Spain.,Barcelona Graduate School of Mathematics, Edifici C, Campus Bellaterra, E-08193 Barcelona, Spain.,Complexity Science Hub Vienna, Josefstädter Strasse 39, 1080 Vienna, Austria
| | - Isabel Serra
- Centre de Recerca Matemàtica, Edifici C, Campus Bellaterra, E-08193 Barcelona, Spain.,Computer Architecture and Operating Systems Group, Barcelona Supercomputing Center (BSC-CNS), E-08034 Barcelona, Spain
| | - Ramon Ferrer-I-Cancho
- Complexity and Quantitative Linguistics Lab, Departament de Ciències de la Computació, Universitat Politècnica de Catalunya, E-08034 Barcelona, Catalonia, Spain
| |
Collapse
|
19
|
Lennox RJ, Veríssimo D, Twardek WM, Davis CR, Jarić I. Sentiment analysis as a measure of conservation culture in scientific literature. CONSERVATION BIOLOGY : THE JOURNAL OF THE SOCIETY FOR CONSERVATION BIOLOGY 2020; 34:462-471. [PMID: 31379018 DOI: 10.1111/cobi.13404] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Revised: 07/10/2019] [Accepted: 07/31/2019] [Indexed: 06/10/2023]
Abstract
Culturomics is emerging as an important field within science, as a way to measure attitudes and beliefs and their dynamics across time and space via quantitative analysis of digitized data from literature, news, film, social media, and more. Sentiment analysis is a culturomics tool that, within the last decade, has provided a means to quantify the polarity of attitudes expressed within various media. Conservation science is a crisis discipline; therefore, accurate and effective communication are paramount. We investigated how conservation scientists communicate their findings through scientific journal articles. We analyzed 15,001 abstracts from articles published from 1998 to 2017 in 6 conservation-focused journals selected based on indexing in scientific databases. Articles were categorized by year, focal taxa, and the conservation status of the focal species. We calculated mean sentiment score for each abstract (mean adjusted z score) based on 4 lexicons (Jockers-Rinker, National Research Council, Bing, and AFINN). We found a significant positive annual trend in the sentiment scores of articles. We also observed a significant trend toward increasing negativity along the spectrum of conservation status categories (i.e., from least concern to extinct). There were some clear differences in the sentiments with which research on different taxa was reported, however. For example, abstracts mentioning lobe finned fishes tended to have high sentiment scores, which could be related to the rediscovery of the coelacanth driving a positive narrative. Contrastingly, abstracts mentioning elasmobranchs had low scores, possibly reflecting the negative sentiment score associated with the word shark. Sentiment analysis has applications in science, especially as it pertains to conservation psychology, and we suggest a new science-based lexicon be developed specifically for the field of conservation.
Collapse
Affiliation(s)
- Robert J Lennox
- NORCE Norwegian Research Centre, Laboratory for Freshwater Ecology and Inland Fisheries, Nygårdsgaten 112, Bergen, 5008, Norway
| | - Diogo Veríssimo
- Department of Zoology, University of Oxford, 11a Mansfield Road, Oxford, OX1 3SZ, U.K
- Oxford Martin School, University of Oxford, 34 Broad Street, Oxford, OX1 3BD, U.K
- Institute for Conservation Research, San Diego Zoo Global, 15600 San Pasqual Valley Road, Escondido, CA, 92027, U.S.A
| | - William M Twardek
- Fish Ecology and Conservation Physiology Laboratory, Carleton University, Ottawa, ON, K1S 5B6, Canada
| | - Colin R Davis
- Insilicor Analytics, 98 Caroline Avenue, Ottawa, ON, K1Y 0S9, Canada
| | - Ivan Jarić
- Biology Centre of the Czech Academy of Sciences, Institute of Hydrobiology, Na Sádkách 702/7, 37005, České Budějovice, Czech Republic
- Faculty of Science, Department of Ecosystem Biology, University of South Bohemia, Branišovská 31a, 37005, České Budějovice, Czech Republic
| |
Collapse
|
20
|
Abstract
Beauty is subjective, and as such it, of course, cannot be defined in absolute terms. But we all know or feel when something is beautiful to us personally. And in such instances, methods of statistical physics and network science can be used to quantify and to better understand what it is that evokes that pleasant feeling, be it when reading a book or looking at a painting. Indeed, recent large-scale explorations of digital data have lifted the veil on many aspects of our artistic expressions that would remain forever hidden in smaller samples. From the determination of complexity and entropy of art paintings to the creation of the flavour network and the principles of food pairing, fascinating research at the interface of art, physics and network science abounds. We here review the existing literature, focusing in particular on culinary, visual, musical and literary arts. We also touch upon cultural history and culturomics, as well as on the connections between physics and the social sciences in general. The review shows that the synergies between these fields yield highly entertaining results that can often be enjoyed by layman and experts alike. In addition to its wider appeal, the reviewed research also has many applications, ranging from improved recommendation to the detection of plagiarism.
Collapse
Affiliation(s)
- Matjaž Perc
- Faculty of Natural Sciences and Mathematics, University of Maribor, Koroška cesta 160, 2000 Maribor, Slovenia.,Department of Medical Research, China Medical University Hospital, China Medical University, Taichung, Taiwan.,Complexity Science Hub Vienna, Josefstädterstraße 39, 1080 Vienna, Austria
| |
Collapse
|
21
|
Chacoma A, Zanette DH. Heaps' Law and Heaps functions in tagged texts: evidences of their linguistic relevance. ROYAL SOCIETY OPEN SCIENCE 2020; 7:200008. [PMID: 32269820 PMCID: PMC7137977 DOI: 10.1098/rsos.200008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2020] [Accepted: 02/21/2020] [Indexed: 06/11/2023]
Abstract
We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.
Collapse
Affiliation(s)
- A. Chacoma
- Instituto de Física Enrique Gaviola, Consejo Nacional de Investigaciones Científicas y Técnicas and Universidad Nacional de Córdoba, Ciudad Universitaria, 5000 Córdoba, Pcia. de Córdoba, Argentina
| | - D. H. Zanette
- Centro Atómico Bariloche and Instituto Balseiro, Comisión Nacional de Energía Atómica and Universidad Nacional de Cuyo, Consejo Nacional de Investigaciones Científicas y Técnicas, Av. Bustillo 9500, 8400 San Carlos de Bariloche, Pcia. de Río Negro, Argentina
| |
Collapse
|
22
|
A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics. ENTROPY 2020; 22:e22010126. [PMID: 33285901 PMCID: PMC7516435 DOI: 10.3390/e22010126] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 01/15/2020] [Accepted: 01/16/2020] [Indexed: 11/16/2022]
Abstract
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.
Collapse
|
23
|
Beyer R, Singarayer JS, Stock JT, Manica A. Environmental conditions do not predict diversification rates in the Bantu languages. Heliyon 2019; 5:e02630. [PMID: 31692645 PMCID: PMC6806388 DOI: 10.1016/j.heliyon.2019.e02630] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 09/26/2019] [Accepted: 10/08/2019] [Indexed: 11/30/2022] Open
Abstract
The global distribution of language diversity mirrors that of several variables related to ecosystem productivity. It has been argued that this is driven by the size of social networks, which tend to be larger in harsher climates to ensure food security, leading to reduced language divergence. Is this pattern purely synchronic, or is there also a quantifiable relationship between environmental conditions and language diversification over time? We used a spatio-temporal phylogeny of the Bantu language family to estimate local diversification rates at the times and locations of language divergence. We compared these data against spatially-explicit reconstructions of several palaeoclimate and palaeovegetation variables (mean annual temperature and the temperature of the coldest and warmest quarter, annual precipitation and the precipitation of the wettest and driest quarter, growing degree days, the length of the growing season, and net primary production), to investigate a potential link between local environmental factors and diversification rates in the Bantu languages. A regression analysis does not suggest a statistically significant relationship between climatic or ecological variables and linguistic diversification over time. We find a strong positive correlation between pairwise linguistic and geographic distances in the Bantu languages, arguing for a dominant role of isolation as a result of the rapid Bantu expansion that might have overwhelmed any potential influence of local environmental factors.
Collapse
Affiliation(s)
- Robert Beyer
- Department of Zoology, University of Cambridge, Cambridge, CB2 3EJ, United Kingdom
- PAVE Research Group, Department of Archaeology, University of Cambridge, Cambridge, CB2 3DZ, United Kingdom
| | - Joy S. Singarayer
- Department of Meteorology and Centre for Past Climate Change, University of Reading, Whiteknights campus, PO Box 243, Reading, RG6 6BB, United Kingdom
| | - Jay T. Stock
- PAVE Research Group, Department of Archaeology, University of Cambridge, Cambridge, CB2 3DZ, United Kingdom
- Department of Anthropology, Western University, London, Ontario, N6A 5C2, Canada
- Department of Archaeology, Max Planck Institute for the Science of Human History, Kahlaische Strasse 10. D-07745 Jena, Germany
| | - Andrea Manica
- Department of Zoology, University of Cambridge, Cambridge, CB2 3EJ, United Kingdom
| |
Collapse
|
24
|
Bokányi E, Kondor D, Vattay G. Scaling in words on Twitter. ROYAL SOCIETY OPEN SCIENCE 2019; 6:190027. [PMID: 31824682 PMCID: PMC6837183 DOI: 10.1098/rsos.190027] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Accepted: 09/08/2019] [Indexed: 05/28/2023]
Abstract
Scaling properties of language are a useful tool for understanding generative processes in texts. We investigate the scaling relations in citywise Twitter corpora coming from the metropolitan and micropolitan statistical areas of the United States. We observe a slightly superlinear urban scaling with the city population for the total volume of the tweets and words created in a city. We then find that a certain core vocabulary follows the scaling relationship of that of the bulk text, but most words are sensitive to city size, exhibiting a super- or a sublinear urban scaling. For both regimes, we can offer a plausible explanation based on the meaning of the words. We also show that the parameters for Zipf's Law and Heaps' Law differ on Twitter from that of other texts, and that the exponent of Zipf's Law changes with city size.
Collapse
Affiliation(s)
| | - Dániel Kondor
- Senseable City Laboratory, MIT, Cambridge, MA 02139, USA
- Singapore-MIT Alliance for Research and Technology, Singapore 138602, Republic of Singapore
| | | |
Collapse
|
25
|
Burridge J, Vaux B, Gnacik M, Grudeva Y. Statistical physics of language maps in the USA. Phys Rev E 2019; 99:032305. [PMID: 30999445 DOI: 10.1103/physreve.99.032305] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Indexed: 11/07/2022]
Abstract
Spatial linguistic surveys often reveal well-defined geographical zones where certain linguistic forms are dominant over their alternatives. It has been suggested that these patterns may be understood by analogy with coarsening in models of two-dimensional physical systems. Here we investigate this connection by comparing data from the Cambridge Online Survey of World Englishes to the behavior of a generalized zero temperature Potts model with long-range interactions. The relative displacements of linguistically similar population centers reveal enhanced east-west affinity. Cluster analysis reveals three distinct linguistic zones. We find that when the interaction kernel is made anisotropic by stretching along the east-west axis, the model can reproduce the three linguistic zones for all interaction parameters tested. The model results are consistent with a view held by some linguists that, in the USA, language use is, or has been, exchanged or transmitted to a greater extent along the east-west axis than the north-south.
Collapse
Affiliation(s)
- J Burridge
- School of Mathematics and Physics, University of Portsmouth, Portsmouth PO1 3HF, United Kingdom
| | - B Vaux
- Faculty of Modern and Medieval Languages, University of Cambridge, Cambridge CB3 9DA, United Kingdom
| | - M Gnacik
- School of Mathematics and Physics, University of Portsmouth, Portsmouth PO1 3HF, United Kingdom
| | - Y Grudeva
- School of Mathematics and Physics, University of Portsmouth, Portsmouth PO1 3HF, United Kingdom
| |
Collapse
|
26
|
Troumbis AY, Hatziantoniou M, Vasios GK. Nutritional Culturomics and Big Data: Macroscopic Patterns of Change in Food, Nutrition and Diet Choices. Curr Pharm Biotechnol 2019; 20:895-908. [PMID: 30747060 DOI: 10.2174/1389201020666190211125550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2018] [Revised: 09/11/2018] [Accepted: 12/10/2018] [Indexed: 11/22/2022]
Abstract
BACKGROUND & OBJECTIVE Nutritional culturomics (NCs) is a specific focus area of culturomics epistemology developing digital humanities and computational linguistics approaches to search for macro-patterns of public interest in food, nutrition and diet choice as a major component of cultural evolution. Cultural evolution is considered as a driver at the interface of environmental and food science, economy and policy. METHODS The paper presents an epistemic programme that builds on the use of big data from webbased services such as Google Trends, Google Adwords or Google Books Ngram Viewer. RESULTS A comparison of clearly defined NCs in terms of geography, culture, linguistics, literacy, technological setups or time period might be used to reveal variations and singularities in public's behavior in terms of adaptation and mitigation policies in the agri-food and public health sectors. CONCLUSION The proposed NC programme is developed along major axes: (1) the definition of an NC; (2) the reconstruction of food and diet histories; (3) the nutrition related epidemiology; (4) the understanding of variability of NCs; (5) the methodological diversification of NCs; (6) the quantifiable limitations and flaws of NCs. A series of indicative examples are presented regarding these NC epistemology components.
Collapse
Affiliation(s)
- Andreas Y Troumbis
- Biodiversity Conservation Laboratory, Department of Environmental Studies, University of the Aegean, Greece
| | - Maria Hatziantoniou
- Section of Environmental Social Sciences, Department of Environmental Studies, University of the Aegean, Greece
| | - Georgios K Vasios
- Department of Food Science and Nutrition; School of the Environment, University of the Aegean, Greece
| |
Collapse
|
27
|
|
28
|
The natural selection of words: Finding the features of fitness. PLoS One 2019; 14:e0211512. [PMID: 30689665 PMCID: PMC6349325 DOI: 10.1371/journal.pone.0211512] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Accepted: 01/15/2019] [Indexed: 11/20/2022] Open
Abstract
We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able to predict the future leader of a synset: the word in the synset that will have the highest frequency. The algorithm uses features based on a word’s length, the characters in the word, and the historical frequencies of the word. It can predict change of leadership (including the identity of the new leader) fifty years in the future, with an F-score considerably above random guessing. Analysis of the learned models provides insight into the causes of change in the leader of a synset. The algorithm confirms observations linguists have made, such as the trend to replace the -ise suffix with -ize, the rivalry between the -ity and -ness suffixes, and the struggle between economy (shorter words are easier to remember and to write) and clarity (longer words are more distinctive and less likely to be confused with one another). The results indicate that integration of the Google Books Ngram Corpus with WordNet has significant potential for improving our understanding of how language evolves.
Collapse
|
29
|
Mapping the Americanization of English in space and time. PLoS One 2018; 13:e0197741. [PMID: 29799872 PMCID: PMC5969760 DOI: 10.1371/journal.pone.0197741] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2018] [Accepted: 05/08/2018] [Indexed: 11/19/2022] Open
Abstract
As global political preeminence gradually shifted from the United Kingdom to the United States, so did the capacity to culturally influence the rest of the world. In this work, we analyze how the world-wide varieties of written English are evolving. We study both the spatial and temporal variations of vocabulary and spelling of English using a large corpus of geolocated tweets and the Google Books datasets corresponding to books published in the US and the UK. The advantage of our approach is that we can address both standard written language (Google Books) and the more colloquial forms of microblogging messages (Twitter). We find that American English is the dominant form of English outside the UK and that its influence is felt even within the UK borders. Finally, we analyze how this trend has evolved over time and the impact that some cultural events have had in shaping it.
Collapse
|
30
|
Raducha T, Gubiec T. Predicting language diversity with complex networks. PLoS One 2018; 13:e0196593. [PMID: 29702699 PMCID: PMC5922521 DOI: 10.1371/journal.pone.0196593] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2018] [Accepted: 04/16/2018] [Indexed: 11/18/2022] Open
Abstract
We analyze the model of social interactions with coevolution of the topology and states of the nodes. This model can be interpreted as a model of language change. We propose different rewiring mechanisms and perform numerical simulations for each. Obtained results are compared with the empirical data gathered from two online databases and anthropological study of Solomon Islands. We study the behavior of the number of languages for different system sizes and we find that only local rewiring, i.e. triadic closure, is capable of reproducing results for the empirical data in a qualitative manner. Furthermore, we cancel the contradiction between previous models and the Solomon Islands case. Our results demonstrate the importance of the topology of the network, and the rewiring mechanism in the process of language change.
Collapse
Affiliation(s)
- Tomasz Raducha
- Institute of Experimental Physics, Faculty of Physics, University of Warsaw, Pasteura 5, 02-093 Warsaw, Poland
- IFISC (CSIC-UIB), Instituto de Física Interdisciplinar y Sistemas Complejos, Campus Universitat de les Illes Balears, E-07122 Palma de Mallorca, Spain
- * E-mail:
| | - Tomasz Gubiec
- Institute of Experimental Physics, Faculty of Physics, University of Warsaw, Pasteura 5, 02-093 Warsaw, Poland
- Center for Polymer Studies, Boston University, Boston, MA 02215 United States of America
| |
Collapse
|
31
|
Ashraf MI, Sinha S. The "handedness" of language: Directional symmetry breaking of sign usage in words. PLoS One 2018; 13:e0190735. [PMID: 29342176 PMCID: PMC5771592 DOI: 10.1371/journal.pone.0190735] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Accepted: 12/15/2017] [Indexed: 11/25/2022] Open
Abstract
Language, which allows complex ideas to be communicated through symbolic sequences, is a characteristic feature of our species and manifested in a multitude of forms. Using large written corpora for many different languages and scripts, we show that the occurrence probability distributions of signs at the left and right ends of words have a distinct heterogeneous nature. Characterizing this asymmetry using quantitative inequality measures, viz. information entropy and the Gini index, we show that the beginning of a word is less restrictive in sign usage than the end. This property is not simply attributable to the use of common affixes as it is seen even when only word roots are considered. We use the existence of this asymmetry to infer the direction of writing in undeciphered inscriptions that agrees with the archaeological evidence. Unlike traditional investigations of phonotactic constraints which focus on language-specific patterns, our study reveals a property valid across languages and writing systems. As both language and writing are unique aspects of our species, this universal signature may reflect an innate feature of the human cognitive phenomenon.
Collapse
Affiliation(s)
- Md. Izhar Ashraf
- The Institute of Mathematical Sciences, Chennai, Tamil Nadu, India
- B. S. Abdur Rahman University, Chennai, Tamil Nadu, India
| | - Sitabhra Sinha
- The Institute of Mathematical Sciences, Chennai, Tamil Nadu, India
- National Institute of Advanced Studies, Bengaluru, Karnataka, India
| |
Collapse
|
32
|
Burridge J. Unifying models of dialect spread and extinction using surface tension dynamics. ROYAL SOCIETY OPEN SCIENCE 2018; 5:171446. [PMID: 29410847 PMCID: PMC5792924 DOI: 10.1098/rsos.171446] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 11/23/2017] [Indexed: 06/01/2023]
Abstract
We provide a unified mathematical explanation of two classical forms of spatial linguistic spread. The wave model describes the radiation of linguistic change outwards from a central focus. Changes can also jump between population centres in a process known as hierarchical diffusion. It has recently been proposed that the spatial evolution of dialects can be understood using surface tension at linguistic boundaries. Here we show that the inclusion of long-range interactions in the surface tension model generates both wave-like spread, and hierarchical diffusion, and that it is surface tension that is the dominant effect in deciding the stable distribution of dialect patterns. We generalize the model to allow population mixing which can induce shrinkage of linguistic domains, or destroy dialect regions from within.
Collapse
|
33
|
Westgate MJ, Lindenmayer DB. The difficulties of systematic reviews. CONSERVATION BIOLOGY : THE JOURNAL OF THE SOCIETY FOR CONSERVATION BIOLOGY 2017; 31:1002-1007. [PMID: 28042667 DOI: 10.1111/cobi.12890] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/16/2016] [Revised: 11/07/2016] [Accepted: 12/19/2016] [Indexed: 05/05/2023]
Abstract
The need for robust evidence to support conservation actions has driven the adoption of systematic approaches to research synthesis in ecology. However, applying systematic review to complex or open questions remains challenging, and this task is becoming more difficult as the quantity of scientific literature increases. We drew on the science of linguistics for guidance as to why the process of identifying and sorting information during systematic review remains so labor intensive, and to provide potential solutions. Several linguistic properties of peer-reviewed corpora-including nonrandom selection of review topics, small-world properties of semantic networks, and spatiotemporal variation in word meaning-greatly increase the effort needed to complete the systematic review process. Conversely, the resolution of these semantic complexities is a common motivation for narrative reviews, but this process is rarely enacted with the rigor applied during linguistic analysis. Therefore, linguistics provides a unifying framework for understanding some key challenges of systematic review and highlights 2 useful directions for future research. First, in cases where semantic complexity generates barriers to synthesis, ecologists should consider drawing on existing methods-such as natural language processing or the construction of research thesauri and ontologies-that provide tools for mapping and resolving that complexity. These tools could help individual researchers classify research material in a more robust manner and provide valuable guidance for future researchers on that topic. Second, a linguistic perspective highlights that scientific writing is a rich resource worthy of detailed study, an observation that can sometimes be lost during the search for data during systematic review or meta-analysis. For example, mapping semantic networks can reveal redundancy and complementarity among scientific concepts, leading to new insights and research questions. Consequently, wider adoption of linguistic approaches may facilitate improved rigor and richness in research synthesis.
Collapse
Affiliation(s)
- Martin J Westgate
- Fenner School of Environment and Society, The Australian National University, Canberra, ACT, 2601, Australia
| | - David B Lindenmayer
- Fenner School of Environment and Society, The Australian National University, Canberra, ACT, 2601, Australia
- ARC Centre of Excellence for Environmental Decisions, The Australian National University, Canberra, ACT, 2601, Australia
| |
Collapse
|
34
|
Nasir A, Kim KM, Caetano-Anollés G. Phylogenetic Tracings of Proteome Size Support the Gradual Accretion of Protein Structural Domains and the Early Origin of Viruses from Primordial Cells. Front Microbiol 2017; 8:1178. [PMID: 28690608 PMCID: PMC5481351 DOI: 10.3389/fmicb.2017.01178] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Accepted: 06/09/2017] [Indexed: 01/05/2023] Open
Abstract
Untangling the origin and evolution of viruses remains a challenging proposition. We recently studied the global distribution of protein domain structures in thousands of completely sequenced viral and cellular proteomes with comparative genomics, phylogenomics, and multidimensional scaling methods. A tree of life describing the evolution of proteomes revealed viruses emerging from the base of the tree as a fourth supergroup of life. A tree of domains indicated an early origin of modern viral lineages from ancient cells that co-existed with the cellular ancestors. However, it was recently argued that the rooting of our trees and the basal placement of viruses was artifactually induced by small genome (proteome) size. Here we show that these claims arise from misunderstanding and misinterpretations of cladistic methodology. Trees are reconstructed unrooted, and thus, their topologies cannot be distorted a posteriori by the rooting methodology. Tracing proteome size in trees and multidimensional views of evolutionary relationships as well as tests of leaf stability and exclusion/inclusion of taxa demonstrated that the smallest proteomes were neither attracted toward the root nor caused any topological distortions of the trees. Simulations confirmed that taxa clustering patterns were independent of proteome size and were determined by the presence of known evolutionary relatives in data matrices, highlighting the need for broader taxon sampling in phylogeny reconstruction. Instead, phylogenetic tracings of proteome size revealed a slowdown in innovation of the structural domain vocabulary and four regimes of allometric scaling that reflected a Heaps law. These regimes explained increasing economies of scale in the evolutionary growth and accretion of kernel proteome repertoires of viruses and cellular organisms that resemble growth of human languages with limited vocabulary sizes. Results reconcile dynamic and static views of domain frequency distributions that are consistent with the axiom of spatiotemporal continuity that is tenet of evolutionary thinking.
Collapse
Affiliation(s)
- Arshan Nasir
- Department of Biosciences, COMSATS Institute of Information TechnologyIslamabad, Pakistan
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-ChampaignUrbana, IL, United States
| | - Kyung Mo Kim
- Division of Polar Life Sciences, Korea Polar Research InstituteIncheon, South Korea
| | - Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-ChampaignUrbana, IL, United States
| |
Collapse
|
35
|
Dodds PS, Dewhurst DR, Hazlehurst FF, Van Oort CM, Mitchell L, Reagan AJ, Williams JR, Danforth CM. Simon's fundamental rich-get-richer model entails a dominant first-mover advantage. Phys Rev E 2017; 95:052301. [PMID: 28618612 DOI: 10.1103/physreve.95.052301] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2016] [Indexed: 11/07/2022]
Abstract
Herbert Simon's classic rich-get-richer model is one of the simplest empirically supported mechanisms capable of generating heavy-tail size distributions for complex systems. Simon argued analytically that a population of flavored elements growing by either adding a novel element or randomly replicating an existing one would afford a distribution of group sizes with a power-law tail. Here, we show that, in fact, Simon's model does not produce a simple power-law size distribution as the initial element has a dominant first-mover advantage, and will be overrepresented by a factor proportional to the inverse of the innovation probability. The first group's size discrepancy cannot be explained away as a transient of the model, and may therefore be many orders of magnitude greater than expected. We demonstrate how Simon's analysis was correct but incomplete, and expand our alternate analysis to quantify the variability of long term rankings for all groups. We find that the expected time for a first replication is infinite, and show how an incipient group must break the mechanism to improve their odds of success. We present an example of citation counts for a specific field that demonstrates a first-mover advantage consistent with our revised view of the rich-get-richer mechanism. Our findings call for a reexamination of preceding work invoking Simon's model and provide an expanded understanding going forward.
Collapse
Affiliation(s)
- Peter Sheridan Dodds
- Vermont Complex Systems Center, Computational Story Lab, Vermont Advanced Computing Core, Department of Mathematics & Statistics, University of Vermont, Burlington, Vermont 05401, USA
| | - David Rushing Dewhurst
- Vermont Complex Systems Center, Computational Story Lab, Vermont Advanced Computing Core, Department of Mathematics & Statistics, University of Vermont, Burlington, Vermont 05401, USA
| | - Fletcher F Hazlehurst
- Vermont Complex Systems Center, Computational Story Lab, Vermont Advanced Computing Core, Department of Mathematics & Statistics, University of Vermont, Burlington, Vermont 05401, USA
| | - Colin M Van Oort
- Vermont Complex Systems Center, Computational Story Lab, Vermont Advanced Computing Core, Department of Mathematics & Statistics, University of Vermont, Burlington, Vermont 05401, USA
| | - Lewis Mitchell
- School of Mathematical Sciences, North Terrace Campus, University of Adelaide, South Australia 5005, Australia
| | - Andrew J Reagan
- Vermont Complex Systems Center, Computational Story Lab, Vermont Advanced Computing Core, Department of Mathematics & Statistics, University of Vermont, Burlington, Vermont 05401, USA
| | - Jake Ryland Williams
- Department of Information Science, Drexel University, 3141 Chestnut Street, Philadelphia, Pennsylvania 19104, USA
| | - Christopher M Danforth
- Vermont Complex Systems Center, Computational Story Lab, Vermont Advanced Computing Core, Department of Mathematics & Statistics, University of Vermont, Burlington, Vermont 05401, USA
| |
Collapse
|
36
|
Lipowska D, Lipowski A. Language competition in a population of migrating agents. Phys Rev E 2017; 95:052308. [PMID: 28618596 DOI: 10.1103/physreve.95.052308] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2017] [Indexed: 11/07/2022]
Abstract
Influencing various aspects of human activity, migration is associated also with language formation. To examine the mutual interaction of these processes, we study a Naming Game with migrating agents. The dynamics of the model leads to formation of low-mobility clusters, which turns out to break the symmetry of the model: although the Naming Game remains symmetric, low-mobility languages are favored. High-mobility languages are gradually eliminated from the system, and the dynamics of language formation considerably slows down. Our model is too simple to explain in detail language competition of migrating human communities, but it certainly shows that languages of settlers are favored over nomadic ones.
Collapse
Affiliation(s)
- Dorota Lipowska
- Faculty of Modern Languages and Literature, Adam Mickiewicz University, Poznań, Poland
| | - Adam Lipowski
- Faculty of Physics, Adam Mickiewicz University, Poznań, Poland
| |
Collapse
|
37
|
The Entropy of Words—Learnability and Expressivity across More than 1000 Languages. ENTROPY 2017. [DOI: 10.3390/e19060275] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
38
|
Bao P, Zhang X. Uncovering and Predicting the Dynamic Process of Collective Attention with Survival Theory. Sci Rep 2017; 7:2621. [PMID: 28572618 PMCID: PMC5453944 DOI: 10.1038/s41598-017-02826-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2017] [Accepted: 04/19/2017] [Indexed: 11/16/2022] Open
Abstract
The subject of collective attention is in the center of this era of information explosion. It is thus of great interest to understand the fundamental mechanism underlying attention in large populations within a complex evolving system. Moreover, an ability to predict the dynamic process of collective attention for individual items has important implications in an array of areas. In this report, we propose a generative probabilistic model using a self-excited Hawkes process with survival theory to model and predict the process through which individual items gain their attentions. This model explicitly captures three key ingredients: the intrinsic attractiveness of an item, characterizing its inherent competitiveness against other items; a reinforcement mechanism based on sum of each previous attention triggers; and a power-law temporal relaxation function, corresponding to the aging in the ability to attract new attentions. Experiments on two population-scale datasets demonstrate that this model consistently outperforms the state-of-the-art methods.
Collapse
Affiliation(s)
- Peng Bao
- School of Software Engineering, Beijing Jiaotong University, Beijing, China.
| | - Xiaoxia Zhang
- School of Economics and Management, Tsinghua University, Beijing, China
| |
Collapse
|
39
|
Brysbaert M, Stevens M, Mandera P, Keuleers E. How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant's Age. Front Psychol 2016; 7:1116. [PMID: 27524974 PMCID: PMC4965448 DOI: 10.3389/fpsyg.2016.01116] [Citation(s) in RCA: 119] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2016] [Accepted: 07/12/2016] [Indexed: 11/13/2022] Open
Abstract
Based on an analysis of the literature and a large scale crowdsourcing experiment, we estimate that an average 20-year-old native speaker of American English knows 42,000 lemmas and 4,200 non-transparent multiword expressions, derived from 11,100 word families. The numbers range from 27,000 lemmas for the lowest 5% to 52,000 for the highest 5%. Between the ages of 20 and 60, the average person learns 6,000 extra lemmas or about one new lemma every 2 days. The knowledge of the words can be as shallow as knowing that the word exists. In addition, people learn tens of thousands of inflected forms and proper nouns (names), which account for the substantially high numbers of ‘words known’ mentioned in other publications.
Collapse
Affiliation(s)
- Marc Brysbaert
- Department of Experimental Psychology, Ghent University Ghent, Belgium
| | - Michaël Stevens
- Department of Experimental Psychology, Ghent University Ghent, Belgium
| | - Paweł Mandera
- Department of Experimental Psychology, Ghent University Ghent, Belgium
| | - Emmanuel Keuleers
- Department of Experimental Psychology, Ghent University Ghent, Belgium
| |
Collapse
|
40
|
Yun J, Shang SC, Wei XD, Liu S, Li ZJ. The possibility of coexistence and co-development in language competition: ecology-society computational model and simulation. SPRINGERPLUS 2016; 5:855. [PMID: 27386304 PMCID: PMC4919202 DOI: 10.1186/s40064-016-2482-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2016] [Accepted: 05/31/2016] [Indexed: 11/10/2022]
Abstract
Language is characterized by both ecological properties and social properties, and competition is the basic form of language evolution. The rise and decline of one language is a result of competition between languages. Moreover, this rise and decline directly influences the diversity of human culture. Mathematics and computer modeling for language competition has been a popular topic in the fields of linguistics, mathematics, computer science, ecology, and other disciplines. Currently, there are several problems in the research on language competition modeling. First, comprehensive mathematical analysis is absent in most studies of language competition models. Next, most language competition models are based on the assumption that one language in the model is stronger than the other. These studies tend to ignore cases where there is a balance of power in the competition. The competition between two well-matched languages is more practical, because it can facilitate the co-development of two languages. A third issue with current studies is that many studies have an evolution result where the weaker language inevitably goes extinct. From the integrated point of view of ecology and sociology, this paper improves the Lotka–Volterra model and basic reaction–diffusion model to propose an “ecology–society” computational model for describing language competition. Furthermore, a strict and comprehensive mathematical analysis was made for the stability of the equilibria. Two languages in competition may be either well-matched or greatly different in strength, which was reflected in the experimental design. The results revealed that language coexistence, and even co-development, are likely to occur during language competition.
Collapse
Affiliation(s)
- Jian Yun
- School of Computer Science and Engineering, Dalian Nationalities University, Dalian, 116600 Liaoning China
| | - Song-Chao Shang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 610054 Sichuan China
| | - Xiao-Dan Wei
- School of Computer Science and Engineering, Dalian Nationalities University, Dalian, 116600 Liaoning China
| | - Shuang Liu
- School of Computer Science and Engineering, Dalian Nationalities University, Dalian, 116600 Liaoning China
| | - Zhi-Jie Li
- School of Computer Science and Engineering, Dalian Nationalities University, Dalian, 116600 Liaoning China
| |
Collapse
|
41
|
Gherardi M, Bassetti F, Cosentino Lagomarsino M. Law of corresponding states for open collaborations. Phys Rev E 2016; 93:042307. [PMID: 27176312 DOI: 10.1103/physreve.93.042307] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2014] [Indexed: 11/07/2022]
Abstract
We study the relation between number of contributors and product size in Wikipedia and GitHub. In contrast to traditional production, this is strongly probabilistic, but is characterized by two quantitative nonlinear laws: a power-law bound to product size for increasing number of contributors, and the universal collapse of rescaled distributions. A variant of the random-energy model shows that both laws are due to the heterogeneity of contributors, and displays an intriguing finite-size scaling property with no equivalent in standard systems. The analysis uncovers the right intensive densities, enabling the comparison of projects with different numbers of contributors on equal grounds. We use this property to expose the detrimental effects of conflicting interactions in Wikipedia.
Collapse
Affiliation(s)
- Marco Gherardi
- Sorbonne Universités, UPMC Univ Paris 06, UMR 7238, Computational and Quantitative Biology, 15 rue de l'École de Médecine Paris, France.,Dipartimento di Fisica, Università degli Studi di Milano, via Celoria 16, 20133 Milano, Italy.,I.N.F.N. Milano
| | | | - Marco Cosentino Lagomarsino
- Sorbonne Universités, UPMC Univ Paris 06, UMR 7238, Computational and Quantitative Biology, 15 rue de l'École de Médecine Paris, France.,CNRS, UMR 7238, Paris, France
| |
Collapse
|
42
|
A triple helix model of medical innovation: Supply, demand, and technological capabilities in terms of Medical Subject Headings. RESEARCH POLICY 2016. [DOI: 10.1016/j.respol.2015.12.004] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
43
|
Alanyali M, Preis T, Moat HS. Tracking Protests Using Geotagged Flickr Photographs. PLoS One 2016; 11:e0150466. [PMID: 26930654 PMCID: PMC4773018 DOI: 10.1371/journal.pone.0150466] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2015] [Accepted: 02/15/2016] [Indexed: 11/18/2022] Open
Abstract
Recent years have witnessed waves of protests sweeping across countries and continents, in some cases resulting in political and governmental change. Much media attention has been focused on the increasing usage of social media to coordinate and provide instantly available reports on these protests. Here, we investigate whether it is possible to identify protest outbreaks through quantitative analysis of activity on the photo sharing site Flickr. We analyse 25 million photos uploaded to Flickr in 2013 across 244 countries and regions, and determine for each week in each country and region what proportion of the photographs are tagged with the word "protest" in 34 different languages. We find that higher proportions of "protest"-tagged photographs in a given country and region in a given week correspond to greater numbers of reports of protests in that country and region and week in the newspaper The Guardian. Our findings underline the potential value of photographs uploaded to the Internet as a source of global, cheap and rapidly available measurements of human behaviour in the real world.
Collapse
Affiliation(s)
- Merve Alanyali
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, Coventry, CV4 7AL, United Kingdom
- * E-mail:
| | - Tobias Preis
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, Coventry, CV4 7AL, United Kingdom
| | - Helen Susannah Moat
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, Coventry, CV4 7AL, United Kingdom
| |
Collapse
|
44
|
Letchford A, Preis T, Moat HS. Quantifying the Search Behaviour of Different Demographics Using Google Correlate. PLoS One 2016; 11:e0149025. [PMID: 26910464 PMCID: PMC4766235 DOI: 10.1371/journal.pone.0149025] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2015] [Accepted: 01/26/2016] [Indexed: 11/18/2022] Open
Abstract
Vast records of our everyday interests and concerns are being generated by our frequent interactions with the Internet. Here, we investigate how the searches of Google users vary across U.S. states with different birth rates and infant mortality rates. We find that users in states with higher birth rates search for more information about pregnancy, while those in states with lower birth rates search for more information about cats. Similarly, we find that users in states with higher infant mortality rates search for more information about credit, loans and diseases. Our results provide evidence that Internet search data could offer new insight into the concerns of different demographics.
Collapse
Affiliation(s)
- Adrian Letchford
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, CV4 7AL, Coventry, United Kingdom
- * E-mail:
| | - Tobias Preis
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, CV4 7AL, Coventry, United Kingdom
| | - Helen Susannah Moat
- Data Science Lab, Behavioural Science, Warwick Business School, University of Warwick, CV4 7AL, Coventry, United Kingdom
| |
Collapse
|
45
|
|
46
|
Abstract
Despite being a paradigm of quantitative linguistics, Zipf’s law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf’s law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf’s law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30 000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf’s law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value), and with only one free parameter (the exponent).
Collapse
|
47
|
Culturomics as a data playground for tests of selection: Mathematical approaches to detecting selection in word use. J Theor Biol 2016; 405:140-9. [PMID: 26802483 DOI: 10.1016/j.jtbi.2015.12.012] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2015] [Revised: 12/01/2015] [Accepted: 12/28/2015] [Indexed: 11/23/2022]
Abstract
In biological evolution traits may rise and fall in frequency due to genetic drift, where variant frequencies change by chance, or by selection where advantageous variants will rise in frequency. The neutral model of evolution, first developed by Kimura in the 1960s, has become the standard against which selection is detected. While the balance between these two important forces - drift and selection - has been well established in biology there are other domains where the contribution of these processes is still coming together. Although the idea of natural selection has been applied to the cultural domain since the time of Darwin, it has proven more challenging to positively identify cultural traits under selection both because of a lack of established tests for selection and a lack of large cultural data sets. However, in recent years with the accumulation of large cultural data sets many cultural features from pre-history pottery to modern baby names have been shown to evolve according to the neutral theory. But there is accumulating empirical evidence from cultural processes suggesting that the neutral theory alone cannot account for all features of the data. As such, there has been a renewed interest in determining whether there is selection amidst drift. Here we analyze a subset English word frequencies, and determine whether frequency change reveals processes of selection. Inspired by the Moran and Wright-Fisher models in population genetics, we developed a neutral model of word frequency variation to assess when linguistic data appears to depart from neutral evolution. As such, our model represents a possible "test for selection" in the linguistic domain. We explore how the distribution of word use has changed for sets of words in English for more than 100 years (1901-2008) as expressed in vocabulary usage in published books, made available by Google Ngram. When comparing empirical word frequency changes to our neutral model we find pervasive and systematic departures from neutrality.
Collapse
|
48
|
|
49
|
Kitsak M, Elmokashfi A, Havlin S, Krioukov D. Long-Range Correlations and Memory in the Dynamics of Internet Interdomain Routing. PLoS One 2015; 10:e0141481. [PMID: 26529312 PMCID: PMC4631327 DOI: 10.1371/journal.pone.0141481] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2015] [Accepted: 10/08/2015] [Indexed: 12/03/2022] Open
Abstract
Data transfer is one of the main functions of the Internet. The Internet consists of a large number of interconnected subnetworks or domains, known as Autonomous Systems (ASes). Due to privacy and other reasons the information about what route to use to reach devices within other ASes is not readily available to any given AS. The Border Gateway Protocol (BGP) is responsible for discovering and distributing this reachability information to all ASes. Since the topology of the Internet is highly dynamic, all ASes constantly exchange and update this reachability information in small chunks, known as routing control packets or BGP updates. In the view of the quick growth of the Internet there are significant concerns with the scalability of the BGP updates and the efficiency of the BGP routing in general. Motivated by these issues we conduct a systematic time series analysis of BGP update rates. We find that BGP update time series are extremely volatile, exhibit long-term correlations and memory effects, similar to seismic time series, or temperature and stock market price fluctuations. The presented statistical characterization of BGP update dynamics could serve as a basis for validation of existing and developing better models of Internet interdomain routing.
Collapse
Affiliation(s)
- Maksim Kitsak
- Department of Physics, Northeastern University, Boston, MA, United States of America
- * E-mail:
| | | | - Shlomo Havlin
- Department of Physics, Bar-Ilan University, Ramat Gan, Israel
| | - Dmitri Krioukov
- Department of Physics, Northeastern University, Boston, MA, United States of America
- Department of Mathematics, Northeastern University, Boston, MA, United States of America
- Department of Electrical&Computer Engineering, Northeastern University, Boston, MA, United States of America
| |
Collapse
|
50
|
Bochkarev V, Solovyev V, Wichmann S. Universals versus historical contingencies in lexical evolution. J R Soc Interface 2015; 11:20140841. [PMID: 25274040 DOI: 10.1098/rsif.2014.0841] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The frequency with which we use different words changes all the time, and every so often, a new lexical item is invented or another one ceases to be used. Beyond a small sample of lexical items whose properties are well studied, little is known about the dynamics of lexical evolution. How do the lexical inventories of languages, viewed as entire systems, evolve? Is the rate of evolution of the lexicon contingent upon historical factors or is it driven by regularities, perhaps to do with universals of cognition and social interaction? We address these questions using the Google Books N-Gram Corpus as a source of data and relative entropy as a measure of changes in the frequency distributions of words. It turns out that there are both universals and historical contingencies at work. Across several languages, we observe similar rates of change, but only at timescales of at least around five decades. At shorter timescales, the rate of change is highly variable and differs between languages. Major societal transformations as well as catastrophic events such as wars lead to increased change in frequency distributions, whereas stability in society has a dampening effect on lexical evolution.
Collapse
Affiliation(s)
- V Bochkarev
- Kazan Federal University, Kremlevskaya Street 18, 420000 Kazan, Russia
| | - V Solovyev
- Kazan Federal University, Kremlevskaya Street 18, 420000 Kazan, Russia
| | - S Wichmann
- Kazan Federal University, Kremlevskaya Street 18, 420000 Kazan, Russia Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103 Leipzig, Germany
| |
Collapse
|