kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Active learning for text classification in cyber security
KTH, School of Electrical Engineering and Computer Science (EECS).
2023 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Aktiv inlärning för textklassificering i cyberdomänen (Swedish)
Abstract [en]

In the domain of cyber security, machine learning promises advanced threat detection. However, the volume of available unlabeled data poses challenges for efficient data management. This study investigates the potential for active learning, a subset of interactive machine learning, to reduce the effort required for manual data labelling. Through different query strategies, the most informative unlabeled data points were selected for manual labelling. The performance of different query strategies was assessed by testing a transformer model’s ability to accurately distinguish tweets mentioning names of advanced persistent threats. The findings suggest that the K-means diversity-based query strategy outperformed both the uncertainty-based approach and the random data point selection, when the amount of labelled training data was limited. This study also evaluated the cost-effective active learning approach, which incorporates high-confidence data points into the training dataset. However, this was shown to be the least effective strategy. Lastly, the study acknowledges that the computational time taken for each query strategy varies significantly between strategies. Hence, an optimal query strategy selection requires a balanced consideration of F-score performance taken together with time efficiency.

Abstract [sv]

Maskininlärning skulle kunna användas för avancerad hotdetektion i cyberdomänen. Dock utgör behovet av träningsdata tillsammans med den stora tillgången till oannoterad data en utmaning. Detta arbete undersöker huruvida aktiv inlärning, en delmängd av interaktiv maskininlärning, kan minska behovet av annoterad data. Genom olika frågestrategier valdes de mest informativa datapunkterna ut för mänsklig annotering. Resultaten för de olika frågestrategierna utvärderades sedan genom att testa en maskininlärningsmodells förmåga att korrekt urskilja tweets som innehåller namn på cyberhotsaktörer. Resultaten tyder på att när mängden annoterad data var begränsad, presterade den diversifieringsbaserade strategin K-means bättre än både den osäkerhetsbaserade frågestrategin och strategin som väljer ut datapunkter slumpmässigt. Denna studie utvärderade också kostnadseffektiv aktiv inlärning som lägger till datapunkter som modellen redan är relativt säker på till träningsdatamängden. Denna metod visade sig dock vara den minst effektiva strategin. Slutligen visar arbetet att beräkningstiden som krävs för varje frågestrategi varierar avsevärt. För att utse den mest optimala frågestrategin krävs därför ett övervägande av både prestanda och tidsåtgång.

Place, publisher, year, edition, pages
2023. , p. 46
Series
TRITA-EECS-EX ; 2023:368
Keywords [en]
Interactive machine learning, Active learning, Cost-effective active learning, Cyber environment
Keywords [sv]
Interaktiv maskininlärning, Aktiv inlärning, Kostnadseffektiv aktiv inlärning, Cyberdomänen
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-336621OAI: oai:DiVA.org:kth-336621DiVA, id: diva2:1797634
External cooperation
Swedish Defence Research Agency
Supervisors
Examiners
Available from: 2023-09-18 Created: 2023-09-15 Last updated: 2023-09-18Bibliographically approved

Open Access in DiVA

fulltext(6194 kB)458 downloads
File information
File name FULLTEXT01.pdfFile size 6194 kBChecksum SHA-512
606e7a271b1f09c9ea1cbc500411e630fd64b02edd7d6d147d0f190e44ca0afc33ced01f9c41a48990a0c48706d1f0ed36b049cd08af8b7124b23eae5c192445
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 458 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1084 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf