kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Comparing Text Similarity Functions For Outlier Detection: In a Dataset with Small Collections of Titles
KTH, School of Electrical Engineering and Computer Science (EECS).
KTH, School of Electrical Engineering and Computer Science (EECS).
2022 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Detecting when a title is put in an incorrect data category can be of interest for commercial digital services, such as streaming platforms, since they group movies by genre. Another example of a beneficiary is price comparison services, which categorises offers by their respective product. In order to find data points that are significantly different from the majority (outliers), outlier detection can be applied. A title in the wrong category is an example of an outlier. Outlier detection algorithms may require a metric that quantify nonsimilarity between two points. Text similarity functions can provide such a metric when comparing text data. The question therefore arises, "Which text similarity function is best suited for detecting incorrect titles in practical environments such as commercial digital services?" In this thesis, different text similarity functions are evaluated when set to detect outlying (incorrect) product titles, with both efficiency and effectiveness taken into consideration. Results show that the variance in performance between functions generally is small, with a few exceptions. The overall top performer is Sørensen-Dice, a function that divides the number of common words with the total amount of words found in both strings. While the function is efficient in the sense that it identifies most outliers in a practical time-frame, it is not likely to find all of them and is therefore deemed to not be effective enough to by applied in practical use. Therefore it might be better applied as part of a larger system, or in combination with manual analysis.

Abstract [sv]

Att identifiera när en titel placeras i en felaktig datakategori kan vara av intresse för kommersiella digitala tjänster, såsom plattformar för filmströmning, eftersom filmer delas upp i genrer. Också prisjämförelsetjänster, som kategoriserar erbjudanden efter produkt skulle dra nytta. Outlier detection kan appliceras för att finna datapunkter som skiljer sig signifikant från de övriga (outliers). En titel i en felaktig kategori är ett exempel på en sådan outlier. Outlier detection algoritmer kan kräva ett mått som kvantifierar hur olika två datapunkter är. Text similarity functions kvantifierar skillnaden mellan textsträngar och kan därför integreras i dessa algoritmer. Med detta uppkommer en följdfråga: "Vilken text similarity function är bäst lämpad för att hitta avvikande titlar i praktiska miljöer såsom kommersiella digitala tjänster?”. I detta examensarbete kommer därför olika text similarity functions att jämföras när de används för att finna felaktiga produkttitlar. Jämförelsen tar hänsyn till både tidseffektivitet och korrekthet. Resultat visar att variationen i prestation mellan funktioner generellt är liten, med ett fåtal undantag. Den totalt sett högst presterande funktionen är Sørensen-Dice, vilken dividerar antalet gemensamma ord med det totala antalet ord i båda texttitlarna. Funktionen är effektiv då den identiferar de flesta outliers inom en praktisk tidsram, men kommer sannolikt inte hitta alla. Istället för att användas som en fullständig lösning, skulle det därför vara fördelaktigt att kombinera den med manuell analys eller en mer övergripande lösning.

Place, publisher, year, edition, pages
2022. , p. 54
Series
TRITA-EECS-EX ; 2022:845
Keywords [en]
Outlier Detection, Text Similarity Functions, Natural Language Processing, N-Nearest Neighbour
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-322838OAI: oai:DiVA.org:kth-322838DiVA, id: diva2:1724526
External cooperation
Panprices
Supervisors
Examiners
Available from: 2023-01-27 Created: 2023-01-08 Last updated: 2023-01-27Bibliographically approved

Open Access in DiVA

fulltext(13059 kB)351 downloads
File information
File name FULLTEXT01.pdfFile size 13059 kBChecksum SHA-512
6c214de54411fbce8f497777e165fb37fe522d1ff700c3de8bfb99421da1f00b36f39486f809b9359281800d1063cb868725d7f06a3f33f9ce4fcfc04d797617
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 351 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 337 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf