Improving Clustering of Swedish Newspaper Articles using Stemming and Compound Splitting
2003 (English)Conference paper (Refereed)
The use of properties of the Swedish language when indexing newspaper articles improves clustering results. To show this a clustering algorithm was implemented and language specific tools were used when building the representation of the articles.Since Swedish is an inflecting language many words have different forms. Thus two documents compared based on word occurrence(i.e. the vector space model and cosine measure of Information Retrieval) do not necessarily become similar although containing the sameword(s). To overcome this we have used a stemmer.Compounds are regularly formed as one word in Swedish. Hence indexing on words leaves the informationin the components of compounds unused.We use the spell checking program Stavato split compounds into their components.Newspapers sort their articles into sections such as Economy, Domestic, Sports etc. Using these we calculate entropy for the clusterings and use as a measure of quality.We have found that stemming improves clustering results on our collections by about 4 % compared to not using it. Compound splitting improves results by about 10 % (by 13 % incombination with stemming). Keeping the original compounds in the representation does not improve results.
Place, publisher, year, edition, pages
2003. 1-7 p.
IdentifiersURN: urn:nbn:se:kth:diva-7120OAI: oai:DiVA.org:kth-7120DiVA: diva2:12036
NoDaLiDa 2003, Reykjavik, Iceland 2003
QC 201008062005-09-292005-09-292010-12-20Bibliographically approved