Part of Speech Tagging for Text Clustering in Swedish
2009 (English)In: Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009, 2009Conference paper (Refereed)
Text clustering could be very useful bothas an intermediate step in a large naturallanguage processing system and as a toolin its own right. The result of a clusteringalgorithm is dependent on the text representationthat is used. Swedish has afairly rich morphology and a large numberof homographs. This possibly leads toproblems in Information Retrieval in general.We investigate the impact on textclustering of adding the part-of-speech-tagto all words in the the common term-bydocumentmatrix.The experiments are carried out on a fewdifferent text sets. None of them give anyevidence that part-of-speech tags improveresults. However, to represent texts usingonly nouns and proper names gives asmaller representation without worsen results.In a few experiments this smallerrepresentation gives better results.We also investigate the effect of lemmatizationand the use of a stoplist, bothof which improves results significantly insome cases.
Place, publisher, year, edition, pages
Computer and Information Science
IdentifiersURN: urn:nbn:se:kth:diva-10124OAI: oai:DiVA.org:kth-10124DiVA: diva2:209212
QC 201008062009-03-242009-03-242010-08-06Bibliographically approved