Change search
ReferencesLink to record
Permanent link

Direct link
Global Evaluation of Random Indexing through Swedish Word Clustering Compared to the People’s Dictionary of Synonyms
KTH, School of Computer Science and Communication (CSC), Numerical Analysis and Computer Science, NADA.
KTH, School of Information and Communication Technology (ICT), Computer and Systems Sciences, DSV.
KTH, School of Computer Science and Communication (CSC), Numerical Analysis and Computer Science, NADA.ORCID iD: 0000-0003-3199-8953
2009 (English)In: Proceedings of the International Conference RANLP-2009, 2009, 376-380 p.Conference paper (Refereed)
Abstract [en]

Evaluation of word space models is usually local in the sense that it only considers words that are deemed very similar by the model. We propose a global evaluation scheme based on clustering of the words. A clustering of high quality in an external evaluation against a semantic resource, such as a dictionary of synonyms, indicates a word space model of high quality. We use Random Indexing to create several different models and compare them by clustering evaluation against the People's Dictionary of Synonyms, a list of Swedish synonyms that are graded by the public. Most notably we get better results for models based on syntagmatic information (words that appear together) than for models based on paradigmatic information (words that appear in similar contexts). This is quite contrary to previous results that have been presented for local evaluation. Clusterings to ten clusters result in a recall of 83% for a syntagmatic model, compared to 34% for a comparable paradigmatic model, and 10% for a random partition.

Place, publisher, year, edition, pages
2009. 376-380 p.
, International Conference Recent Advances in Natural Language Processing, RANLP, ISSN 1313-8502
Keyword [en]
Random Indexing, Word Space Model, Word Clustering, Evaluation, Dictionary of Synonyms
National Category
Computer and Information Science Language Technology (Computational Linguistics)
URN: urn:nbn:se:kth:diva-10125ScopusID: 2-s2.0-84866846352OAI: diva2:209222
International Conference on Recent Advances in Natural Language Processing, RANLP-2009; Borovets; Bulgaria; 14 September 2009 through 16 September 2009

QC 20100806

Available from: 2009-03-24 Created: 2009-03-24 Last updated: 2014-09-24Bibliographically approved
In thesis
1. Text Clustering Exploration: Swedish Text Representation and Clustering Results Unraveled
Open this publication in new window or tab >>Text Clustering Exploration: Swedish Text Representation and Clustering Results Unraveled
2009 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Text clustering divides a set of texts into clusters (parts), so that texts within each cluster are similar in content. It may be used to uncover the structure and content of unknown text sets as well as to give new perspectives on familiar ones. The main contributions of this thesis are an investigation of text representation for Swedish and some extensions of the work on how to use text clustering as an exploration tool. We have also done some work on synonyms and evaluation of clustering results. Text clustering, at least such as it is treated here, is performed using the vector space model, which is commonly used in information retrieval. This model represents texts by the words that appear in them and considers texts similar in content if they share many words. Languages differ in what is considered a word. We have investigated the impact of some of the characteristics of Swedish on text clustering. Swedish has more morphological variation than for instance English. We show that it is beneficial to use the lemma form of words rather than the word forms. Swedish has a rich production of solid compounds. Most of the constituents of these are used on their own as words and in several different compounds. In fact, Swedish solid compounds often correspond to phrases or open compounds in other languages. Our experiments show that it is beneficial to split solid compounds into their parts when building the representation. The vector space model does not regard word order. We have tried to extend it with nominal phrases in different ways. We have also tried to differentiate between homographs, words that look alike but mean different things, by augmenting all words with a tag indicating their part of speech. None of our experiments using phrases or part of speech information have shown any improvement over using the ordinary model. Evaluation of text clustering results is very hard. What is a good partition of a text set is inherently subjective. External quality measures compare a clustering with a (manual) categorization of the same text set. The theoretical best possible value for a measure is known, but it is not obvious what a good value is – text sets differ in difficulty to cluster and categorizations are more or less adapted to a particular text set. We describe how evaluation can be improved for cases where a text set has more than one categorization. In such cases the result of a clustering can be compared with the result for one of the categorizations, which we assume is a good partition. In some related work we have built a dictionary of synonyms. We use it to compare two different principles for automatic word relation extraction through clustering of words. Text clustering can be used to explore the contents of a text set. We have developed a visualization method that aids such exploration, and implemented it in a tool, called Infomat. It presents the representation matrix directly in two dimensions. When the order of texts and words are changed, by for instance clustering, distributional patterns that indicate similarities between texts and words appear. We have used Infomat to explore a set of free text answers about occupation from a questionnaire given to over 40 000 Swedish twins. The questionnaire also contained a closed answer regarding smoking. We compared several clusterings of the text answers to the closed answer, regarded as a categorization, by means of clustering evaluation. A recurring text cluster of high quality led us to formulate the hypothesis that “farmers smoke less than the average”, which we later could verify by reading previous studies. This hypothesis generation method could be used on any set of texts that is coupled with data that is restricted to a limited number of possible values.

Place, publisher, year, edition, pages
Stockholm: KTH, 2009. vii, 71 p.
Trita-CSC-A, ISSN 1653-5723 ; 2009:4
National Category
Computer and Information Science
urn:nbn:se:kth:diva-10129 (URN)978-91-7415-251-7 (ISBN)
Public defence
2009-04-06, Sal F3, KTH, Lindstedtsvägen 26, Stockholm, 13:15 (English)
QC 20100806Available from: 2009-03-24 Created: 2009-03-24 Last updated: 2010-08-06Bibliographically approved

Open Access in DiVA

No full text

Other links

ScopusPublished version

Search in DiVA

By author/editor
Rosell, MagnusHassel, MartinKann, Viggo
By organisation
Numerical Analysis and Computer Science, NADAComputer and Systems Sciences, DSV
Computer and Information ScienceLanguage Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 416 hits
ReferencesLink to record
Permanent link

Direct link