kth.sePublications
System disruptions
We are currently experiencing disruptions on the search portals due to high traffic. We are working to resolve the issue, you may temporarily encounter an error message.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Quantifying Meaning
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).ORCID iD: 0000-0002-2236-4978
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [sv]

Distributionella semantikmodeller är en klass av maskininlärningsmodeller med syfte att konstruera representationer som fångar semantik, i.e. mening, av objekt som bär mening på ett datadrivet sätt. Denna avhandling är särskilt inriktad på konstruktion av semantisk representation av ord, en strävan som har en lång historia inom datalingvistik och som sett dramatiska utvecklingar under de senaste åren.

Det primära forskningsmålet med denna avhandling är att utforska gränserna och tillämpningarna av distributionella semantikmodeller av ord, i.e. word embeddings. I synnerhet utforskar den relationen mellan modell- och embeddingsemantik, det vill säga hur modelldesign påverkar vad ord-embeddings innehåller, hur man resonerar om ord-embeddings, och hur egenskaperna hos modellen kan utnyttjas för att extrahera ny information från embeddings. Konkret introducerar vi topologiskt medvetna grannskapsfrågor som berikar den information som erhålls från grannskap extraherade från distributionella sematikmodeller, villkorade likhetsfrågor (och modeller som möjliggör dem), konceptutvinning från distributionella semantikmodeller, tillämpningar av embbeddningmodeller inom statsvetenskap, samt en grundlig utvärdering av en bred mängd av distributionella semantikmodeller.

Abstract [en]

Distributional semantic models are a class of machine learning models with the aim of constructing representations that capture the semantics, i.e. meaning, of objects that carry meaning in a data-driven fashion. This thesis is particularly concerned with the construction of semantic representations of words, an endeavour that has a long history in computational linguistics, and that has seen dramatic developments in recent years.

The primary research objective of this thesis is to explore the limits and applications of distributional semantic models of words, i.e. word embeddings. In particular, it explores the relation between model and embedding semantics, i.e. how model design influences what our embeddings encode, how to reason about embeddings, and how properties of the model can be exploited to extract novel information from embeddings. Concretely, we introduce topologically aware neighborhood queries that enrich the information gained from neighborhood queries on distributional semantic models, conditioned similarity queries (and models enabling them), concept extraction from distributional semantic models, applications of embedding models in the realm of political science, as well as a thorough evaluation of a broad range of distributional semantic models. 

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2023. , p. 45
Series
TRITA-EECS-AVL ; 2023:2
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-322262ISBN: 978-91-8040-444-0 (print)OAI: oai:DiVA.org:kth-322262DiVA, id: diva2:1717044
Public defence
2023-01-17, Zoom: https://kth-se.zoom.us/j/66943302856, F3, Lindstedtsvägen 26, Stockholm, 09:00 (English)
Opponent
Supervisors
Note

QC 20221207

Available from: 2022-12-08 Created: 2022-12-07 Last updated: 2025-02-07Bibliographically approved
List of papers
1. Navigating the Semantic Horizon using Relative Neighborhood Graphs
Open this publication in new window or tab >>Navigating the Semantic Horizon using Relative Neighborhood Graphs
2015 (English)In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (ACL) , 2015Conference paper, Published paper (Refereed)
Abstract [en]

This paper introduces a novel way to navigate neighborhoods in distributional semantic models. The approach is based on relative neighborhood graphs, which uncover the topological structure of local neighborhoods in semantic space. This has the potential to overcome both the problem with selecting a proper k in k-NN search, and the problem that a ranked list of neighbors may conflate several different senses. We provide both qualitative and quantitative results that support the viability of the proposed method.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2015
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-322091 (URN)10.18653/v1/d15-1292 (DOI)
Note

QC 20221202

Available from: 2022-12-01 Created: 2022-12-01 Last updated: 2025-02-07Bibliographically approved
2. Distributional term set expansion
Open this publication in new window or tab >>Distributional term set expansion
2018 (English)Conference paper, Published paper (Other academic)
Abstract [en]

This paper is a short empirical study of the performance of centrality and classification based iterative term set expansion methods for distributional semantic models. Iterative term set expansion is an interactive process using distributional semantics models where a user labels terms as belonging to some sought after term set, and a system uses this labeling to supply the user with new, candidate, terms to label, trying to maximize the number of positive examples found. While centrality based methods have a long history in term set expansion (Sarmento et al., 2007; Pantel et al., 2009), we compare them to classification methods based on the the Simple Margin method, an Active Learning approach to classification using Support Vector Machines (Tong and Koller, 2002). Examining the performance of various centrality and classification based methods for a variety of distributional models over five different term sets, we can show that active learning based methods consistently outperform centrality based methods.

Keywords
Active Learning, Distributional Semantics, Lexicon Acquisition, Term Set Expansion, Word Embeddings, Artificial intelligence, Semantics, Embeddings, Set expansions, Iterative methods, Natural Sciences, Naturvetenskap
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-322156 (URN)2-s2.0-85059894892 (Scopus ID)9791095546009 (ISBN)
Conference
LREC 2018 - 11th International Conference on Language Resources and Evaluation, 20182-s2.0-85059894892
Note

QC 20221202

Available from: 2022-12-02 Created: 2022-12-02 Last updated: 2024-03-18Bibliographically approved
3. Shallow Contextualized Word Embeddings
Open this publication in new window or tab >>Shallow Contextualized Word Embeddings
(English)Manuscript (preprint) (Other academic)
Abstract [en]

    This paper introduces a novel word embedding method that is able to learn contextualized representations using a shallow model based on factorization machines. We discuss the limits of log-linear models and demonstrate how our proposed model -- Continuous Bag of Pairs (CBoP) -- can overcome these limits. We also demonstrate contextualized word similarity queries with the CBoP model, i.e. queries of the kind "What words are similar to orange, given a context word juice?'' We validate our model using standard word-based and sentence-based similarity benchmarks and observe that there is little difference between CBoP and a comparable CBoW model on word-based benchmarks, that CBoP outperforms CBoW on Semantic Textual Similarity benchmarks, yet is worse than CBoW on sentence classification tasks.

National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-322090 (URN)
Note

QC 20221202

Available from: 2022-12-01 Created: 2022-12-01 Last updated: 2025-02-07Bibliographically approved
4. Measuring Issue Ownership using Word Embeddings
Open this publication in new window or tab >>Measuring Issue Ownership using Word Embeddings
2018 (English)Other (Other academic)
Abstract [en]

Sentiment and topic analysis are commonmethods used for social media monitoring.Essentially, these methods answers questionssuch as, “what is being talked about, regardingX”, and “what do people feel, regarding X”.In this paper, we investigate another venue forsocial media monitoring, namely issue ownership and agenda setting, which are conceptsfrom political science that have been used toexplain voter choice and electoral outcomes.We argue that issue alignment and agenda setting can be seen as a kind of semantic sourcesimilarity of the kind “how similar is sourceA to issue owner P, when talking about issue X”, and as such can be measured usingword/document embedding techniques. Wepresent work in progress towards measuringthat kind of conditioned similarity, and introduce a new notion of similarity for predictive embeddings. We then test this methodby measuring the similarity between politically aligned media and political pparties, conditioned on bloc-specific issues.

Keywords
Natural Sciences, Naturvetenskap
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-322155 (URN)
Note

QC 20221202

Available from: 2022-12-02 Created: 2022-12-02 Last updated: 2024-03-18Bibliographically approved
5. R-grams: Unsupervised Learning of Semantic Units in Natural Language
Open this publication in new window or tab >>R-grams: Unsupervised Learning of Semantic Units in Natural Language
2019 (English)In: Proceedings of the 13th International Conference on Computational Semantics - Student Papers, 2019Conference paper, Published paper (Refereed)
Abstract [en]

This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques. In contrast to previous work which has primarily been focused on subword units for machine translation, we are interested in the general properties of such segments above the word level. We call these segments r-grams, and discuss their properties and the effect they have on the token frequency distribution. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.

National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-322093 (URN)10.48550/arXiv.1808.04670 (DOI)
Conference
13th International Conference on Computational Semantics
Note

QC 20221202

Available from: 2022-12-01 Created: 2022-12-01 Last updated: 2025-02-07Bibliographically approved
6. A Distributional Semantic Online Lexicon for Linguistic Explorations of Societies
Open this publication in new window or tab >>A Distributional Semantic Online Lexicon for Linguistic Explorations of Societies
Show others...
2022 (English)In: Social science computer review, ISSN 0894-4393, E-ISSN 1552-8286Article in journal (Refereed) Published
Abstract [en]

Linguistic Explorations of Societies (LES) is an interdisciplinary research project with scholars from the fields of political science, computer science, and computational linguistics. The overarching ambition of LES has been to contribute to the survey-based comparative scholarship by compiling and analyzing online text data within and between languages and countries. To this end, the project has developed an online semantic lexicon, which allows researchers to explore meanings and usages of words in online media across a substantial number of geo-coded languages. The lexicon covers data from approximately 140 language-country combinations and is, to our knowledge, the most extensive free research resource of its kind. Such a resource makes it possible to critically examine survey translations and identify discrepancies in order to modify and improve existing survey methodology, and its unique features further enable Internet researchers to study public debate online from a comparative perspective. In this article, we discuss the social scientific rationale for using online text data as a complement to survey data, and present the natural language processing-based methodology behind the lexicon including its underpinning theory and practical modeling. Finally, we engage in a critical reflection about the challenges of using online text data to gauge public opinion and political behavior across the world.

Place, publisher, year, edition, pages
SAGE Publications, 2022
Keywords
distributional semantics, natural language processing, word2vec, comparative surveys, language use, semantic similarities, Language Technology (Computational Linguistics), Språkteknologi (språkvetenskaplig databehandling)
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-322157 (URN)10.1177/08944393211049774 (DOI)000787865700001 ()2-s2.0-85130070813 (Scopus ID)
Note

QC 20221202

Available from: 2022-12-02 Created: 2022-12-02 Last updated: 2025-02-07Bibliographically approved
7. A comparative evaluation and analysis of three generations of Distributional Semantic Models
Open this publication in new window or tab >>A comparative evaluation and analysis of three generations of Distributional Semantic Models
Show others...
2022 (English)In: Language resources and evaluation, ISSN 1574-020X, E-ISSN 1574-0218, Vol. 56, no 4, p. 1269-1313Article in journal (Refereed) Published
Abstract [en]

Distributional semantics has deeply changed in the last decades. First, predict models stole the thunder from traditional count ones, and more recently both of them were replaced in many NLP applications by contextualized vectors produced by neural language models. Although an extensive body of research has been devoted to Distributional Semantic Model (DSM) evaluation, we still lack a thorough comparison with respect to tested models, semantic tasks, and benchmark datasets. Moreover, previous work has mostly focused on task-driven evaluation, instead of exploring the differences between the way models represent the lexical semantic space. In this paper, we perform a large-scale evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT. First of all, we investigate the performance of embeddings in several semantic tasks, carrying out an in-depth statistical analysis to identify the major factors influencing the behavior of DSMs. The results show that (i) the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous and (ii) static DSMs surpass BERT representations in most out-of-context semantic tasks and datasets. Furthermore, we borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models. RSA reveals important differences related to the frequency and part-of-speech of lexical items.

National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-322094 (URN)10.1007/s10579-021-09575-z (DOI)000762866400001 ()2-s2.0-85125439429 (Scopus ID)
Note

QC 20221202

Available from: 2022-12-01 Created: 2022-12-01 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

Thesis(967 kB)235 downloads
File information
File name FULLTEXT03.pdfFile size 967 kBChecksum SHA-512
72012a8b8958d5930409dcd5aea1614f37d0f131f2ef677802c12475b8023c545eb8480419c8ccf7379b3eec8ae3623b61da527d1e46f0bfc8ede4097badf5c0
Type fulltextMimetype application/pdf

Authority records

Cuba Gyllensten, Amaru

Search in DiVA

By author/editor
Cuba Gyllensten, Amaru
By organisation
Computational Science and Technology (CST)
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 235 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1411 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf