1
|
Menéndez‐Serra M, Ontiveros VJ, Barberán A, Casamayor EO. Absence of stress‐promoted facilitation coupled with a competition decrease in the microbiome of ephemeral saline lakes. Ecology 2022; 103:e3834. [PMID: 35872610 PMCID: PMC10078231 DOI: 10.1002/ecy.3834] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 05/03/2022] [Accepted: 06/23/2022] [Indexed: 11/06/2022]
Abstract
Salinity fluctuations constitute a well-known high stress factor strongly shaping global biological distributions and abundances. However, there is a knowledge gap regarding how increasing saline stress affects microbial biological interactions. We applied the combination of a probabilistic method for estimating significant co-occurrences/exclusions and a conceptual framework for filtering out associations potentially linked to environmental and/or spatial factors, in a series of connected ephemeral (hyper) saline lakes. We carried out a network analysis over the full aquatic microbiome-bacteria, eukarya, and archaea-under severe salinity fluctuations. Most of the observed co-occurrences/exclusions were potentially explained by environmental niche and/or dispersal limitation. Co-occurrences assigned to potential biological interactions remained stable, suggesting that the salt gradient was not promoting interspecific facilitation processes. Conversely, co-exclusions assigned to potential biological interactions decreased along the gradient both in number and network complexity, pointing to a decrease of interspecies competition as salinity increased. Overall, higher saline stress reduced microbial co-exclusions while co-occurrences remained stable suggesting decreasing competition coupled with lack of stress-gradient promoted facilitation in the microbiome of ephemeral saline lakes.
Collapse
Affiliation(s)
| | - Vicente J. Ontiveros
- Theoretical and Computational Ecology Group, Centre of Advanced Studies of Blanes (CEAB), Spanish Research Council (CSIC) Blanes Catalonia Spain
| | - Albert Barberán
- Department of Environmental Science University of Arizona Tucson AZ USA
| | | |
Collapse
|
2
|
Abstract
A fundamental question in biology is why some species tend to occur together in the same locations, while others are never observed coexisting. This question becomes particularly relevant for microorganisms thriving in the highly diluted waters of high mountain lakes, where biotic interactions might be required to make the most of an extreme environment. We studied a high-throughput gene data set of alpine lakes (>220 Pyrenean lakes) with cooccurrence network analysis to infer potential biotic interactions, using the combination of a probabilistic method for determining significant cooccurrences and coexclusions between pairs of species and a conceptual framework for classifying the nature of the observed cooccurrences and coexclusions. This computational approach (i) determined and quantified the importance of environmental variables and spatial distribution and (ii) defined potential interacting microbial assemblages. We determined the properties and relationships between these assemblages by examining node properties at the taxonomic level, indicating associations with their potential habitat sources (i.e., aquatic versus terrestrial) and their functional strategies (i.e., parasitic versus mixotrophic). Environmental variables explained fewer pairs in bacteria than in microbial eukaryotes for the alpine data set, with pH alone explaining the highest proportion of bacterial pairs. Nutrient composition was also relevant for explaining association pairs, particularly in microeukaryotes. We identified a reduced subset of pairs with the highest probability of species interactions (“interacting guilds”) that significantly reached higher occupancies and lower mean relative abundances in agreement with the carrying capacity hypothesis. The interacting bacterial guilds could be more related to habitat and microdispersal processes (i.e., aquatic versus soil microbes), whereas for microeukaryotes trophic roles (osmotrophs, mixotrophs, and parasitics) could potentially play a major role. Overall, our approach may add helpful information to guide further efforts for a mechanistic understanding of microbial interactions in situ. IMPORTANCE A fundamental question in biology is why some species tend to occur together in the same locations, while others are never observed to coexist. This question becomes particularly relevant for microorganisms thriving in the highly diluted waters of high mountain lakes, in which biotic interactions might be required to make the most of an extreme environment. Microbial metacommunities are too often only studied in terms of their environmental niches and geographic barriers since they show inherent difficulties to quantify biological interactions and their role as drivers of ecosystem functioning. Our study highlights that telling apart potential interactions from both environmental and geographic niches may help for the initial characterization of organisms with similar ecologies in a large scope of ecosystems, even when information about actual interactions is partial and limited. The multilayered statistical approach carried out here offers the possibility of going beyond taxonomy to understand microbiological behavior in situ.
Collapse
|
3
|
Duque A, Fabregat H, Araujo L, Martinez-Romo J. A keyphrase-based approach for interpretable ICD-10 code classification of Spanish medical reports. Artif Intell Med 2021; 121:102177. [PMID: 34763812 DOI: 10.1016/j.artmed.2021.102177] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 09/14/2021] [Accepted: 09/14/2021] [Indexed: 11/25/2022]
Abstract
BACKGROUND AND OBJECTIVES The 10th version of International Classification of Diseases (ICD-10) codification system has been widely adopted by the health systems of many countries, including Spain. However, manual code assignment of Electronic Health Records (EHR) is a complex and time-consuming task that requires a great amount of specialised human resources. Therefore, several machine learning approaches are being proposed to assist in the assignment task. In this work we present an alternative system for automatically recommending ICD-10 codes to be assigned to EHRs. METHODS Our proposal is based on characterising ICD-10 codes by a set of keyphrases that represent them. These keyphrases do not only include those that have literally appeared in some EHR with the considered ICD-10 codes assigned, but also others that have been obtained by a statistical process able to capture expressions that have led the annotators to assign the code. RESULTS The result is an information model that allows to efficiently recommend codes to a new EHR based on their textual content. We explore an approach that proves to be competitive with other state-of-the-art approaches and can be combined with them to optimise results. CONCLUSIONS In addition to its effectiveness, the recommendations of this method are easily interpretable since the phrases in an EHR leading to recommend an ICD-10 code are known. Moreover, the keyphrases associated with each ICD-10 code can be a valuable additional source of information for other approaches, such as machine learning techniques.
Collapse
Affiliation(s)
- Andres Duque
- Universidad Nacional de Educación a Distancia (UNED). ETS Ingeniería Informática, Juan del Rosal 16, 28040 Madrid, Spain; Instituto Mixto de Investigación - Escuela Nacional de Sanidad (IMIENS), Spain.
| | - Hermenegildo Fabregat
- Universidad Nacional de Educación a Distancia (UNED). ETS Ingeniería Informática, Juan del Rosal 16, 28040 Madrid, Spain.
| | - Lourdes Araujo
- Universidad Nacional de Educación a Distancia (UNED). ETS Ingeniería Informática, Juan del Rosal 16, 28040 Madrid, Spain; Instituto Mixto de Investigación - Escuela Nacional de Sanidad (IMIENS), Spain.
| | - Juan Martinez-Romo
- Universidad Nacional de Educación a Distancia (UNED). ETS Ingeniería Informática, Juan del Rosal 16, 28040 Madrid, Spain; Instituto Mixto de Investigación - Escuela Nacional de Sanidad (IMIENS), Spain.
| |
Collapse
|
4
|
Tamarit I, Pereda M, Cuesta JA. Hierarchical clustering of bipartite data sets based on the statistical significance of coincidences. Phys Rev E 2020; 102:042304. [PMID: 33212688 DOI: 10.1103/physreve.102.042304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Accepted: 09/13/2020] [Indexed: 11/07/2022]
Abstract
When some 'entities' are related by the 'features' they share they are amenable to a bipartite network representation. Plant-pollinator ecological communities, co-authorship of scientific papers, customers and purchases, or answers in a poll, are but a few examples. Analyzing clustering of such entities in the network is a useful tool with applications in many fields, like internet technology, recommender systems, or detection of diseases. The algorithms most widely applied to find clusters in bipartite networks are variants of modularity optimization. Here, we provide a hierarchical clustering algorithm based on a dissimilarity between entities that quantifies the probability that the features shared by two entities are due to mere chance. The algorithm performance is O(n^{2}) when applied to a set of n entities, and its outcome is a dendrogram exhibiting the connections of those entities. Through the introduction of a 'susceptibility' measure we can provide an 'optimal' choice for the clustering as well as quantify its quality. The dendrogram reveals further useful structural information though-like the existence of subclusters within clusters or of nodes that do not fit in any cluster. We illustrate the algorithm by applying it first to a set of synthetic networks, and then to a selection of examples. We also illustrate how to transform our algorithm into a valid alternative for one-mode networks as well, and show that it performs at least as well as the standard, modularity-based algorithms-with a higher numerical performance. We provide an implementation of the algorithm in python freely accessible from GitHub.
Collapse
Affiliation(s)
- Ignacio Tamarit
- Grupo Interdisciplinar de Sistemas Complejos (GISC), Departamento de Matemáticas de la Universidad Carlos III de Madrid, Leganés, Spain.,Unidad Mixta Interdisciplinar de Comportamiento y Complejidad Social (UMICCS), Madrid, Spain
| | - María Pereda
- Grupo Interdisciplinar de Sistemas Complejos (GISC), Departamento de Matemáticas de la Universidad Carlos III de Madrid, Leganés, Spain.,Unidad Mixta Interdisciplinar de Comportamiento y Complejidad Social (UMICCS), Madrid, Spain.,Grupo de Investigación Ingeniería de Organización y Logística (IOL), Escuela Técnica Superior de Ingenieros Industriales, Universidad Politécnica de Madrid, Madrid, Spain
| | - José A Cuesta
- Grupo Interdisciplinar de Sistemas Complejos (GISC), Departamento de Matemáticas de la Universidad Carlos III de Madrid, Leganés, Spain.,Unidad Mixta Interdisciplinar de Comportamiento y Complejidad Social (UMICCS), Madrid, Spain.,Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), Universidad de Zaragoza, Zaragoza, Spain.,UC3M-Santander Big Data Institute (IBiDat), Getafe, Spain
| |
Collapse
|
5
|
Rodriguez-Prieto O, Araujo L, Martinez-Romo J. Discovering related scientific literature beyond semantic similarity: a new co-citation approach. Scientometrics 2019. [DOI: 10.1007/s11192-019-03125-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
6
|
Bovet A, Morone F, Makse HA. Validation of Twitter opinion trends with national polling aggregates: Hillary Clinton vs Donald Trump. Sci Rep 2018; 8:8673. [PMID: 29875364 PMCID: PMC5989214 DOI: 10.1038/s41598-018-26951-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Accepted: 05/09/2018] [Indexed: 11/09/2022] Open
Abstract
Measuring and forecasting opinion trends from real-time social media is a long-standing goal of big-data analytics. Despite the large amount of work addressing this question, there has been no clear validation of online social media opinion trend with traditional surveys. Here we develop a method to infer the opinion of Twitter users by using a combination of statistical physics of complex networks and machine learning based on hashtags co-occurrence to build an in-domain training set of the order of a million tweets. We validate our method in the context of 2016 US Presidential Election by comparing the Twitter opinion trend with the New York Times National Polling Average, representing an aggregate of hundreds of independent traditional polls. The Twitter opinion trend follows the aggregated NYT polls with remarkable accuracy. We investigate the dynamics of the social network formed by the interactions among millions of Twitter supporters and infer the support of each user to the presidential candidates. Our analytics unleash the power of Twitter to uncover social trends from elections, brands to political movements, and at a fraction of the cost of traditional surveys.
Collapse
Affiliation(s)
- Alexandre Bovet
- Levich Institute and Physics Department, City College of New York, New York, New York, 10031, USA
| | - Flaviano Morone
- Levich Institute and Physics Department, City College of New York, New York, New York, 10031, USA
| | - Hernán A Makse
- Levich Institute and Physics Department, City College of New York, New York, New York, 10031, USA.
| |
Collapse
|
7
|
Duque A, Stevenson M, Martinez-Romo J, Araujo L. Co-occurrence graphs for word sense disambiguation in the biomedical domain. Artif Intell Med 2018; 87:9-19. [DOI: 10.1016/j.artmed.2018.03.002] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Revised: 01/23/2018] [Accepted: 03/11/2018] [Indexed: 10/17/2022]
|
8
|
Akimushkin C, Amancio DR, Oliveira ON. Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks. PLoS One 2017; 12:e0170527. [PMID: 28125703 PMCID: PMC5268788 DOI: 10.1371/journal.pone.0170527] [Citation(s) in RCA: 51] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2016] [Accepted: 12/24/2016] [Indexed: 11/18/2022] Open
Abstract
Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks.
Collapse
Affiliation(s)
- Camilo Akimushkin
- São Carlos Institute of Physics, University of São Paulo, São Carlos, São Paulo, Brazil
| | - Diego Raphael Amancio
- Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, São Paulo, Brazil
| | | |
Collapse
|
9
|
Duque A, Martinez-Romo J, Araujo L. Can multilinguality improve Biomedical Word Sense Disambiguation? J Biomed Inform 2016; 64:320-332. [PMID: 27815227 DOI: 10.1016/j.jbi.2016.10.020] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Revised: 10/24/2016] [Accepted: 10/31/2016] [Indexed: 10/20/2022]
Abstract
Ambiguity in the biomedical domain represents a major issue when performing Natural Language Processing tasks over the huge amount of available information in the field. For this reason, Word Sense Disambiguation is critical for achieving accurate systems able to tackle complex tasks such as information extraction, summarization or document classification. In this work we explore whether multilinguality can help to solve the problem of ambiguity, and the conditions required for a system to improve the results obtained by monolingual approaches. Also, we analyze the best ways to generate those useful multilingual resources, and study different languages and sources of knowledge. The proposed system, based on co-occurrence graphs containing biomedical concepts and textual information, is evaluated on a test dataset frequently used in biomedicine. We can conclude that multilingual resources are able to provide a clear improvement of more than 7% compared to monolingual approaches, for graphs built from a small number of documents. Also, empirical results show that automatically translated resources are a useful source of information for this particular task.
Collapse
Affiliation(s)
- Andres Duque
- NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain.
| | - Juan Martinez-Romo
- NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain.
| | - Lourdes Araujo
- NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain.
| |
Collapse
|
10
|
de Arruda HF, Costa LDF, Amancio DR. Topic segmentation via community detection in complex networks. CHAOS (WOODBURY, N.Y.) 2016; 26:063120. [PMID: 27368785 DOI: 10.1063/1.4954215] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Many real systems have been modeled in terms of network concepts, and written texts are a particular example of information networks. In recent years, the use of network methods to analyze language has allowed the discovery of several interesting effects, including the proposition of novel models to explain the emergence of fundamental universal patterns. While syntactical networks, one of the most prevalent networked models of written texts, display both scale-free and small-world properties, such a representation fails in capturing other textual features, such as the organization in topics or subjects. We propose a novel network representation whose main purpose is to capture the semantical relationships of words in a simple way. To do so, we link all words co-occurring in the same semantic context, which is defined in a threefold way. We show that the proposed representations favor the emergence of communities of semantically related words, and this feature may be used to identify relevant topics. The proposed methodology to detect topics was applied to segment selected Wikipedia articles. We found that, in general, our methods outperform traditional bag-of-words representations, which suggests that a high-level textual representation may be useful to study the semantical features of texts.
Collapse
Affiliation(s)
- Henrique F de Arruda
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, São Paulo, Brazil
| | - Luciano da F Costa
- São Carlos Institute of Physics, University of São Paulo, São Carlos, São Paulo, Brazil
| | - Diego R Amancio
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, São Paulo, Brazil
| |
Collapse
|
11
|
Abstract
Statistical methods have been widely employed to study the fundamental properties of language. In recent years, methods from complex and dynamical systems proved useful to create several language models. Despite the large amount of studies devoted to represent texts with physical models, only a limited number of studies have shown how the properties of the underlying physical systems can be employed to improve the performance of natural language processing tasks. In this paper, I address this problem by devising complex networks methods that are able to improve the performance of current statistical methods. Using a fuzzy classification strategy, I show that the topological properties extracted from texts complement the traditional textual description. In several cases, the performance obtained with hybrid approaches outperformed the results obtained when only traditional or networked methods were used. Because the proposed model is generic, the framework devised here could be straightforwardly used to study similar textual applications where the topology plays a pivotal role in the description of the interacting agents.
Collapse
Affiliation(s)
- Diego Raphael Amancio
- Institute of Mathematical and Computer Sciences, University of São Paulo, São Carlos, São Paulo, Brazil
- * E-mail:
| |
Collapse
|
12
|
Duque A, Martinez-Romo J, Araujo L. Choosing the best dictionary for Cross-Lingual Word Sense Disambiguation. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.02.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
13
|
Martinez-Romo J, Araujo L, Duque Fernandez A. SemGraph: Extracting keyphrases following a novel semantic graph-based approach. J Assoc Inf Sci Technol 2015. [DOI: 10.1002/asi.23365] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Juan Martinez-Romo
- NLP & IR Group; Dpto. Lenguajes y Sistemas Informáticos; Universidad Nacional de Educación a Distancia (UNED); Juan del Rosal 16. 28040 Madrid Spain
| | - Lourdes Araujo
- NLP & IR Group; Dpto. Lenguajes y Sistemas Informáticos; Universidad Nacional de Educación a Distancia (UNED); Juan del Rosal 16. 28040 Madrid Spain
| | - Andres Duque Fernandez
- NLP & IR Group; Dpto. Lenguajes y Sistemas Informáticos; Universidad Nacional de Educación a Distancia (UNED); Juan del Rosal 16. 28040 Madrid Spain
| |
Collapse
|
14
|
Capitán JA, Borge-Holthoefer J, Gómez S, Martinez-Romo J, Araujo L, Cuesta JA, Arenas A. Local-based semantic navigation on a networked representation of information. PLoS One 2012; 7:e43694. [PMID: 22937081 PMCID: PMC3427177 DOI: 10.1371/journal.pone.0043694] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2012] [Accepted: 07/23/2012] [Indexed: 11/18/2022] Open
Abstract
The size and complexity of actual networked systems hinders the access to a global knowledge of their structure. This fact pushes the problem of navigation to suboptimal solutions, one of them being the extraction of a coherent map of the topology on which navigation takes place. In this paper, we present a Markov chain based algorithm to tag networked terms according only to their topological features. The resulting tagging is used to compute similarity between terms, providing a map of the networked information. This map supports local-based navigation techniques driven by similarity. We compare the efficiency of the resulting paths according to their length compared to that of the shortest path. Additionally we claim that the path steps towards the destination are semantically coherent. To illustrate the algorithm performance we provide some results from the Simple English Wikipedia, which amounts to several thousand of pages. The simplest greedy strategy yields over an 80% of average success rate. Furthermore, the resulting content-coherent paths most often have a cost between one- and threefold compared to shortest-path lengths.
Collapse
Affiliation(s)
- José A Capitán
- Departament d'Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili, Tarragona, Spain.
| | | | | | | | | | | | | |
Collapse
|