Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
GDTM: Graph-based Dynamic Topic Models
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.ORCID iD: 0000-0003-1007-8533
Research Institute of Sweden (RISE).
(English)In: Article in journal (Refereed) Submitted
Abstract [en]

Dynamic Topic Modeling (DTM) is the ultimate solution for extracting topics from short texts generated in Online Social Networks (OSNs) like Twitter. A DTM solution is required to be scalable and to be able to account for sparsity in short texts and dynamicity of topics. Current solutions combine probabilistic mixture models like Dirichlet Multinomial or PitmanYor Process with approximate inference approaches like Gibbs Sampling and Stochastic Variational Inference to, respectively, account for dynamicity and scalability in DTM. However, these solutions rely on weak probabilistic language models, which do not account for sparsity in short texts. In addition, their inference is based on iterative optimization algorithms, which have scalability issues when it comes to DTM. We present GDTM, a single-pass graph-based DTM algorithm, to solve the problem. GDTM combines a context-rich and incremental feature representation model, called Random Indexing (RI), with a novel online graph partitioning algorithm to address scalability and dynamicity. In addition, GDTM uses a rich language modeling approach based on the Skip-gram technique to account for sparsity. We run multiple experiments over a large-scale Twitter dataset to analyze the accuracy and scalability of GDTM and compare the results with four state-of-the-art approaches. The results show that GDTM outperforms the best approach by 11% on accuracy and performs by an order of magnitude faster while creating 4 times better topic quality over standard evaluation metrics.

Keywords [en]
Topic Modeling, Dimensionality Reduction, Distributional Semantics, Language Modeling, Graph Partitioning
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-263828OAI: oai:DiVA.org:kth-263828DiVA, id: diva2:1370380
Note

QC 20191115

Available from: 2019-11-15 Created: 2019-11-15 Last updated: 2019-12-05Bibliographically approved
In thesis
1. Graph Algorithms for Large-Scale and Dynamic Natural Language Processing
Open this publication in new window or tab >>Graph Algorithms for Large-Scale and Dynamic Natural Language Processing
2019 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In Natural Language Processing, researchers design and develop algorithms to enable machines to understand and analyze human language. These algorithms benefit multiple downstream applications including sentiment analysis, automatic translation, automatic question answering, and text summarization. Topic modeling is one such algorithm that solves the problem of categorizing documents into multiple groups with the goal of maximizing the intra-group document similarity. However, the manifestation of short texts like tweets, snippets, comments, and forum posts as the dominant source of text in our daily interactions and communications, as well as being the main medium for news reporting and dissemination, increases the complexity of the problem due to scalability, sparsity, and dynamicity. Scalability refers to the volume of the messages being generated, sparsity is related to the length of the messages, and dynamicity is associated with the ratio of changes in the content and topical structure of the messages (e.g., the emergence of new phrases). We improve the scalability and accuracy of Natural Language Processing algorithms from three perspectives, by leveraging on innovative graph modeling and graph partitioning algorithms, incremental dimensionality reduction techniques, and rich language modeling methods. We begin by presenting a solution for multiple disambiguation on short messages, as opposed to traditional single disambiguation. The solution proposes a simple graph representation model to present topical structures in the form of dense partitions in that graph and applies disambiguation by extracting those topical structures using an innovative distributed graph partitioning algorithm. Next, we develop a scalable topic modeling algorithm using a novel dense graph representation and an efficient graph partitioning algorithm. Then, we analyze the effect of temporal dimension to understand the dynamicity in online social networks and present a solution for geo-localization of users in Twitter using a hierarchical model that combines partitioning of the underlying social network graph with temporal categorization of the tweets. The results show the effect of temporal dynamicity on users’ spatial behavior. This result leads to design and development of a dynamic topic modeling solution, involving an online graph partitioning algorithm and a significantly stronger language modeling approach based on the skip-gram technique. The algorithm shows strong improvement on scalability and accuracy compared to the state-of-the-art models. Finally, we describe a dynamic graph-based representation learning algorithm that modifies the partitioning algorithm to develop a generalization of our previous work. A strong representation learning algorithm is proposed that can be used for extracting high quality distributed and continuous representations out of any sequential data with local and hierarchical structural properties similar to natural language text.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2019. p. 42
Series
TRITA-EECS-AVL ; 2019:85
Keywords
Natural Language Processing; Lexical Disambiguation; Topic Modeling; Representation Learning; Graph Partitioning; Distributed Algorithms; Dimensionality Reduction; Random Indexing;
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-263914 (URN)978-91-7873-377-4 (ISBN)
Public defence
2019-12-17, Sal C, Electrum, Kistagången 16, Kista, 10:00 (English)
Opponent
Supervisors
Note

QC 20191125

Available from: 2019-11-25 Created: 2019-11-19 Last updated: 2019-11-25Bibliographically approved

Open Access in DiVA

No full text in DiVA

Authority records BETA

Ghoorchian, Kambiz

Search in DiVA

By author/editor
Ghoorchian, Kambiz
By organisation
Software and Computer systems, SCS
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 106 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf