Change search
ReferencesLink to record
Permanent link

Direct link
Comparison of Automatic Classifiers’ Performances using Word-based Feature Extraction Techniques in an E-government setting
KTH, School of Information and Communication Technology (ICT).
2011 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Nowadays email is commonly used by citizens to establish communication with their government. On the received emails, governments deal with some common queries and subjects which some handling officers have to manually answer. Automatic email classification of the incoming emails allows to increase the communication efficiency by decreasing the delay between the query and its response.

This thesis takes part within the IMAIL project, which aims to provide an automatic answering solution to the Swedish Social Insurance Agency (SSIA) (“Försäkringskassan” in Swedish). The goal of this thesis is to analyze and compare the classification performance of different sets of features extracted from SSIA emails on different automatic classifiers. The features extracted from the emails will depend on the previous preprocessing that is carried out as well. Compound splitting, lemmatization, stop words removal, Part-of-Speech tagging and Ngrams are the processes used in the data set. Moreover, classifications will be performed using Support Vector Machines, k- Nearest Neighbors and Naive Bayes. For the analysis and comparison of different results, precision, recall and F-measure are used.

From the results obtained in this thesis, SVM provides the best classification with a F-measure value of 0.787. However, Naive Bayes provides a better classification for most of the email categories than SVM. Thus, it can not be concluded whether SVM classify better than Naive Bayes or not.

Furthermore, a comparison to Dalianis et al. (2011) is made. The results obtained in this approach outperformed the results obtained before. SVM provided a F-measure value of 0.858 when using PoS-tagging on original emails. This result improves by almost 3% the 0.83 obtained in Dalianis et al. (2011). In this case, SVM was clearly better than Naive Bayes.

Place, publisher, year, edition, pages
2011. , 64 p.
Trita-ICT-EX, 25
Keyword [en]
E-government, machine learning, WEKA, SVM, Naive Bayes, kNN, Swedish, PoStagging, feature extraction, feature selection, automatic e-mail classification
National Category
Computer and Information Science
URN: urn:nbn:se:kth:diva-32363OAI: diva2:410293
Available from: 2011-04-13 Created: 2011-04-13 Last updated: 2011-05-11Bibliographically approved

Open Access in DiVA

fulltext(608 kB)331 downloads
File information
File name FULLTEXT01.pdfFile size 608 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 331 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 230 hits
ReferencesLink to record
Permanent link

Direct link