kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Segmentation of companies using DBSCAN and K-Means
KTH, School of Electrical Engineering and Computer Science (EECS).
KTH, School of Electrical Engineering and Computer Science (EECS).
2022 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesisAlternative title
Segmentering av företag med DBSCAN och K-Means (Swedish)
Abstract [en]

Data management and machine learning have become an important tool for organizations around the world, to be able to provide a basis for further processing, for example. This work aims at helping a company with mapping of corporate industries with the help of keywords from companies’ websites. We will do this with machine learning. The essay will consistently explain how this model has been created by describing utilized algorithms, theories, methods and its performance. The work examines the clustering methods K-means and DBSCAN with the vectorization methods TF-IDF and Bag of Words. Evaluation is done using the Silhouette Coefficient (SC) and individual assessment. DBSCAN proves to be a better clustering method on this data set. However, there are problems with the amount of data, for example how distinct the differences are between the companies' keywords. This problem means that the clustering methods create too big uncertainties to allow for it to be used for commercial purposes. It is possible to use this tool for future implementations, but the amount of data must have more distinct differences.

Abstract [sv]

Datahantering och maskininlärning har blivit ett viktigt verktyg för organisationer runt om i världen, för att exempelvis ge underlag för vidareförädling. Detta arbete syftar till att hjälpa ett företag med kartläggning av företagsbranscher med hjälp av nyckelord från bolagens hemsidor. Vi kommer göra detta med maskininlärning. Uppsatsen kommer i detalj att förklara hur denna modell har skapats genom beskrivning av utnyttjade algoritmer, teorier, metoder och dess prestanda. Arbetet undersöker klustringsmetoderna K- means och DBSCAN med vektoriseringsmetoderna TF-IDF och Bag of Words. Utvärdering sker med hjälp av metoden Silhouette Coefficient (SC) samt en individuell bedömning. DBSCAN visar sig vara en bättre klustringsmetod på denna datamängd. Däremot finns det problem i datamängden, det vill säga hur distinkta skillnaderna är mellan företagens nyckelord. Detta problem gör att klustringsmetoderna skapar alltför stora osäkerheter för att kunna användas i kommersiellt syfte. Det är möjligt att använda detta verktyg för framtida implementationer, däremot behöver datamängden ha mer distinkta skillnader.

Place, publisher, year, edition, pages
2022. , p. 13
Series
TRITA-EECS-EX ; 2022:352
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-319123OAI: oai:DiVA.org:kth-319123DiVA, id: diva2:1699091
Supervisors
Examiners
Available from: 2022-09-28 Created: 2022-09-26 Last updated: 2022-09-28Bibliographically approved

Open Access in DiVA

fulltext(1022 kB)141 downloads
File information
File name FULLTEXT01.pdfFile size 1022 kBChecksum SHA-512
233ab1c206fd6ba8320a048716ccb0b4f61c2d70a872f7fda5ce6873112b8f38af868f823574fde8166bff1daf660cc893dde581b4d5ad6db0bddb78a01af949
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Dahlkvist, JacobTomczak, William
By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 141 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 301 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf