Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
KTH, School of Electrical Engineering and Computer Science (EECS), Software and Computer systems, SCS.ORCID iD: 0000-0001-8457-4105
2018 (English)In: Journal of Computational And Graphical Statistics, ISSN 1061-8600, E-ISSN 1537-2715, Vol. 27, no 2, p. 449-463Article in journal (Refereed) Published
Abstract [en]

Topic models, and more specifically the class of latent Dirichlet allocation (LDA), are widely used for probabilistic modeling of text. Markov chain Monte Carlo (MCMC) sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler. Supplementary materials for this article are available online.

Place, publisher, year, edition, pages
American Statistical Association , 2018. Vol. 27, no 2, p. 449-463
Keywords [en]
Bayesian inference, Computational complexity, Gibbs sampling, Latent Dirichlet allocation, Massive datasets, Parallel computing
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:kth:diva-238254DOI: 10.1080/10618600.2017.1366913ISI: 000435688200018Scopus ID: 2-s2.0-85046690915OAI: oai:DiVA.org:kth-238254DiVA, id: diva2:1259974
Funder
Swedish Foundation for Strategic Research
Note

QC 20181031

Available from: 2018-10-31 Created: 2018-10-31 Last updated: 2018-10-31Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records BETA

Broman, David

Search in DiVA

By author/editor
Broman, David
By organisation
Software and Computer systems, SCS
In the same journal
Journal of Computational And Graphical Statistics
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 37 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf