kth.sePublications
System disruptions
We are currently experiencing disruptions on the search portals due to high traffic. We are working to resolve the issue, you may temporarily encounter an error message.
Change search
Link to record
Permanent link

Direct link
Cuba Gyllensten, AmaruORCID iD iconorcid.org/0000-0002-2236-4978
Publications (8 of 8) Show all publications
Cuba Gyllensten, A. (2023). Quantifying Meaning. (Doctoral dissertation). Stockholm: KTH Royal Institute of Technology
Open this publication in new window or tab >>Quantifying Meaning
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [sv]

Distributionella semantikmodeller är en klass av maskininlärningsmodeller med syfte att konstruera representationer som fångar semantik, i.e. mening, av objekt som bär mening på ett datadrivet sätt. Denna avhandling är särskilt inriktad på konstruktion av semantisk representation av ord, en strävan som har en lång historia inom datalingvistik och som sett dramatiska utvecklingar under de senaste åren.

Det primära forskningsmålet med denna avhandling är att utforska gränserna och tillämpningarna av distributionella semantikmodeller av ord, i.e. word embeddings. I synnerhet utforskar den relationen mellan modell- och embeddingsemantik, det vill säga hur modelldesign påverkar vad ord-embeddings innehåller, hur man resonerar om ord-embeddings, och hur egenskaperna hos modellen kan utnyttjas för att extrahera ny information från embeddings. Konkret introducerar vi topologiskt medvetna grannskapsfrågor som berikar den information som erhålls från grannskap extraherade från distributionella sematikmodeller, villkorade likhetsfrågor (och modeller som möjliggör dem), konceptutvinning från distributionella semantikmodeller, tillämpningar av embbeddningmodeller inom statsvetenskap, samt en grundlig utvärdering av en bred mängd av distributionella semantikmodeller.

Abstract [en]

Distributional semantic models are a class of machine learning models with the aim of constructing representations that capture the semantics, i.e. meaning, of objects that carry meaning in a data-driven fashion. This thesis is particularly concerned with the construction of semantic representations of words, an endeavour that has a long history in computational linguistics, and that has seen dramatic developments in recent years.

The primary research objective of this thesis is to explore the limits and applications of distributional semantic models of words, i.e. word embeddings. In particular, it explores the relation between model and embedding semantics, i.e. how model design influences what our embeddings encode, how to reason about embeddings, and how properties of the model can be exploited to extract novel information from embeddings. Concretely, we introduce topologically aware neighborhood queries that enrich the information gained from neighborhood queries on distributional semantic models, conditioned similarity queries (and models enabling them), concept extraction from distributional semantic models, applications of embedding models in the realm of political science, as well as a thorough evaluation of a broad range of distributional semantic models. 

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2023. p. 45
Series
TRITA-EECS-AVL ; 2023:2
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-322262 (URN)978-91-8040-444-0 (ISBN)
Public defence
2023-01-17, Zoom: https://kth-se.zoom.us/j/66943302856, F3, Lindstedtsvägen 26, Stockholm, 09:00 (English)
Opponent
Supervisors
Note

QC 20221207

Available from: 2022-12-08 Created: 2022-12-07 Last updated: 2025-02-07Bibliographically approved
Öhman, J., Verlinden, S., Ekgren, A., Cuba Gyllensten, A., Isbister, T., Gogoulou, E., . . . Sahlgren, M. (2023). The Nordic Pile: A 1.2 TB Nordic Dataset for Language Modeling.
Open this publication in new window or tab >>The Nordic Pile: A 1.2 TB Nordic Dataset for Language Modeling
Show others...
2023 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the development of the LLMS in the Nordic languages, we curate a high-quality dataset consisting of 1.2TB of text, in all of the major North Germanic languages (Danish, Icelandic, Norwegian, and Swedish), as well as some high-quality English data. This paper details our considerations and processes for collecting, cleaning, and filtering the dataset. 

National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-354951 (URN)10.48550/arXiv.2303.17183 (DOI)
Note

QC 20241023

Available from: 2024-10-17 Created: 2024-10-17 Last updated: 2025-02-07Bibliographically approved
Lenci, A., Sahlgren, M., Jeuniaux, P., Cuba Gyllensten, A. & Miliani, M. (2022). A comparative evaluation and analysis of three generations of Distributional Semantic Models. Language resources and evaluation, 56(4), 1269-1313
Open this publication in new window or tab >>A comparative evaluation and analysis of three generations of Distributional Semantic Models
Show others...
2022 (English)In: Language resources and evaluation, ISSN 1574-020X, E-ISSN 1574-0218, Vol. 56, no 4, p. 1269-1313Article in journal (Refereed) Published
Abstract [en]

Distributional semantics has deeply changed in the last decades. First, predict models stole the thunder from traditional count ones, and more recently both of them were replaced in many NLP applications by contextualized vectors produced by neural language models. Although an extensive body of research has been devoted to Distributional Semantic Model (DSM) evaluation, we still lack a thorough comparison with respect to tested models, semantic tasks, and benchmark datasets. Moreover, previous work has mostly focused on task-driven evaluation, instead of exploring the differences between the way models represent the lexical semantic space. In this paper, we perform a large-scale evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT. First of all, we investigate the performance of embeddings in several semantic tasks, carrying out an in-depth statistical analysis to identify the major factors influencing the behavior of DSMs. The results show that (i) the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous and (ii) static DSMs surpass BERT representations in most out-of-context semantic tasks and datasets. Furthermore, we borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models. RSA reveals important differences related to the frequency and part-of-speech of lexical items.

National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-322094 (URN)10.1007/s10579-021-09575-z (DOI)000762866400001 ()2-s2.0-85125439429 (Scopus ID)
Note

QC 20221202

Available from: 2022-12-01 Created: 2022-12-01 Last updated: 2025-02-07Bibliographically approved
Dahlberg, S., Axelsson, S., Gyllensten, A. C., Sahlgren, M., Ekgren, A., Holmberg, S. & Schwarz, J. A. (2022). A Distributional Semantic Online Lexicon for Linguistic Explorations of Societies. Social science computer review
Open this publication in new window or tab >>A Distributional Semantic Online Lexicon for Linguistic Explorations of Societies
Show others...
2022 (English)In: Social science computer review, ISSN 0894-4393, E-ISSN 1552-8286Article in journal (Refereed) Published
Abstract [en]

Linguistic Explorations of Societies (LES) is an interdisciplinary research project with scholars from the fields of political science, computer science, and computational linguistics. The overarching ambition of LES has been to contribute to the survey-based comparative scholarship by compiling and analyzing online text data within and between languages and countries. To this end, the project has developed an online semantic lexicon, which allows researchers to explore meanings and usages of words in online media across a substantial number of geo-coded languages. The lexicon covers data from approximately 140 language-country combinations and is, to our knowledge, the most extensive free research resource of its kind. Such a resource makes it possible to critically examine survey translations and identify discrepancies in order to modify and improve existing survey methodology, and its unique features further enable Internet researchers to study public debate online from a comparative perspective. In this article, we discuss the social scientific rationale for using online text data as a complement to survey data, and present the natural language processing-based methodology behind the lexicon including its underpinning theory and practical modeling. Finally, we engage in a critical reflection about the challenges of using online text data to gauge public opinion and political behavior across the world.

Place, publisher, year, edition, pages
SAGE Publications, 2022
Keywords
distributional semantics, natural language processing, word2vec, comparative surveys, language use, semantic similarities, Language Technology (Computational Linguistics), Språkteknologi (språkvetenskaplig databehandling)
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-322157 (URN)10.1177/08944393211049774 (DOI)000787865700001 ()2-s2.0-85130070813 (Scopus ID)
Note

QC 20221202

Available from: 2022-12-02 Created: 2022-12-02 Last updated: 2025-02-07Bibliographically approved
Cuba Gyllensten, A., Ekgren, A. & Sahlgren, M. (2019). R-grams: Unsupervised Learning of Semantic Units in Natural Language. In: Proceedings of the 13th International Conference on Computational Semantics - Student Papers: . Paper presented at 13th International Conference on Computational Semantics.
Open this publication in new window or tab >>R-grams: Unsupervised Learning of Semantic Units in Natural Language
2019 (English)In: Proceedings of the 13th International Conference on Computational Semantics - Student Papers, 2019Conference paper, Published paper (Refereed)
Abstract [en]

This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques. In contrast to previous work which has primarily been focused on subword units for machine translation, we are interested in the general properties of such segments above the word level. We call these segments r-grams, and discuss their properties and the effect they have on the token frequency distribution. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.

National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-322093 (URN)10.48550/arXiv.1808.04670 (DOI)
Conference
13th International Conference on Computational Semantics
Note

QC 20221202

Available from: 2022-12-01 Created: 2022-12-01 Last updated: 2025-02-07Bibliographically approved
Gyllensten, A. C. & Sahlgren, M. (2018). Distributional term set expansion. In: : . Paper presented at LREC 2018 - 11th International Conference on Language Resources and Evaluation, 20182-s2.0-85059894892 (pp. 2554-2558).
Open this publication in new window or tab >>Distributional term set expansion
2018 (English)Conference paper, Published paper (Other academic)
Abstract [en]

This paper is a short empirical study of the performance of centrality and classification based iterative term set expansion methods for distributional semantic models. Iterative term set expansion is an interactive process using distributional semantics models where a user labels terms as belonging to some sought after term set, and a system uses this labeling to supply the user with new, candidate, terms to label, trying to maximize the number of positive examples found. While centrality based methods have a long history in term set expansion (Sarmento et al., 2007; Pantel et al., 2009), we compare them to classification methods based on the the Simple Margin method, an Active Learning approach to classification using Support Vector Machines (Tong and Koller, 2002). Examining the performance of various centrality and classification based methods for a variety of distributional models over five different term sets, we can show that active learning based methods consistently outperform centrality based methods.

Keywords
Active Learning, Distributional Semantics, Lexicon Acquisition, Term Set Expansion, Word Embeddings, Artificial intelligence, Semantics, Embeddings, Set expansions, Iterative methods, Natural Sciences, Naturvetenskap
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-322156 (URN)2-s2.0-85059894892 (Scopus ID)9791095546009 (ISBN)
Conference
LREC 2018 - 11th International Conference on Language Resources and Evaluation, 20182-s2.0-85059894892
Note

QC 20221202

Available from: 2022-12-02 Created: 2022-12-02 Last updated: 2024-03-18Bibliographically approved
Gyllensten, A. C. & Sahlgren, M. (2018). Measuring Issue Ownership using Word Embeddings.
Open this publication in new window or tab >>Measuring Issue Ownership using Word Embeddings
2018 (English)Other (Other academic)
Abstract [en]

Sentiment and topic analysis are commonmethods used for social media monitoring.Essentially, these methods answers questionssuch as, “what is being talked about, regardingX”, and “what do people feel, regarding X”.In this paper, we investigate another venue forsocial media monitoring, namely issue ownership and agenda setting, which are conceptsfrom political science that have been used toexplain voter choice and electoral outcomes.We argue that issue alignment and agenda setting can be seen as a kind of semantic sourcesimilarity of the kind “how similar is sourceA to issue owner P, when talking about issue X”, and as such can be measured usingword/document embedding techniques. Wepresent work in progress towards measuringthat kind of conditioned similarity, and introduce a new notion of similarity for predictive embeddings. We then test this methodby measuring the similarity between politically aligned media and political pparties, conditioned on bloc-specific issues.

Keywords
Natural Sciences, Naturvetenskap
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-322155 (URN)
Note

QC 20221202

Available from: 2022-12-02 Created: 2022-12-02 Last updated: 2024-03-18Bibliographically approved
Cuba Gyllensten, A. & Sahlgren, M. (2015). Navigating the Semantic Horizon using Relative Neighborhood Graphs. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing: . Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>Navigating the Semantic Horizon using Relative Neighborhood Graphs
2015 (English)In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (ACL) , 2015Conference paper, Published paper (Refereed)
Abstract [en]

This paper introduces a novel way to navigate neighborhoods in distributional semantic models. The approach is based on relative neighborhood graphs, which uncover the topological structure of local neighborhoods in semantic space. This has the potential to overcome both the problem with selecting a proper k in k-NN search, and the problem that a ranked list of neighbors may conflate several different senses. We provide both qualitative and quantitative results that support the viability of the proposed method.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2015
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-322091 (URN)10.18653/v1/d15-1292 (DOI)
Note

QC 20221202

Available from: 2022-12-01 Created: 2022-12-01 Last updated: 2025-02-07Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-2236-4978

Search in DiVA

Show all publications