kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Scraping bot detection using machine learning
KTH, School of Electrical Engineering and Computer Science (EECS).
KTH, School of Electrical Engineering and Computer Science (EECS).
2022 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesisAlternative title
Botdetektering med hjälp av maskininlärning (Swedish)
Abstract [en]

Illegitimate acquisition and use of data is a problematic issue faced by many organizations operating web servers on the internet today. Despite frameworks of rules to prevent ”scraping bots” from carrying out this action, they have developed advanced methods to continue taking data. Following research into what the problem is and how it can be handled, this report identifies and evaluates how machine learning can be used to detect bots. Since developing and testing a machine learning solution proved difficult, an alternative solution was also developed aiming to polarize (separate) bot and human traffic through behavioral analysis. This particular solution to optimize traffic session classification is presented and discussed, as well as, other key findings which can help in detecting and preventing these unwanted visitors.

Abstract [sv]

Olaglig insamling och användning av data är problematiskt för många organisationer som idag använder sig av webbservrar på internet. Trots ramar av regler för att förhindra ”scraping bots” så har de utvecklat avancerade sätt att komma åt data. Efter forskning om vad problemet är och hur det kan hanteras, identifierar och evaluerar denna rapport hur maskininlärning kan användas för att detektera bottar. Då utvecklingen och testningen av en lösning med hjälp av maskininlärning visade sig bli svårt, utvecklades en alternativ lösning med målet att polarisera (separera) bottrafik och legitim trafik. Denna lösning presenteras och diskuteras i rapporten tillsammans med andra nyckelresultat som kan hjälpa till att upptäcka och förhindra dessa oönskade besökare.

Place, publisher, year, edition, pages
2022. , p. 68
Series
TRITA-EECS-EX ; 2022:355
Keywords [en]
Artificial agents, Bot detection, Machine learning, Data analysis, HTTP requests, ReCaptcha
Keywords [sv]
Artificiella agenter, Detektering av bottar, Maskininlärning, Dataanalys, HTTP förfrågningar, ReCaptcha
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-320391OAI: oai:DiVA.org:kth-320391DiVA, id: diva2:1705061
External cooperation
The Mobile Life AB
Supervisors
Examiners
Available from: 2022-10-21 Created: 2022-10-20 Last updated: 2022-10-21Bibliographically approved

Open Access in DiVA

fulltext(1277 kB)2305 downloads
File information
File name FULLTEXT01.pdfFile size 1277 kBChecksum SHA-512
877e42d651b9b779edcf4cbbb8517713dbd98cec6382b6b12357d7b5f2061fefd123bae1e26b4f1c1962851bc4d92045c2f71329d4ba22a65fbf45376bdafdbc
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 2311 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 830 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf