In the domain of cybersecurity, machine learning can offer advanced threat detection. However, the volume of unlabeled data poses challenges for efficient data management. This study investigates the potential for active learning to reduce the effort required for manual data labeling. Through different query strategies, the most informative unlabeled data points were selected for labeling. The performance of different query strategies was assessed by testing a transformer model's ability to accurately distinguish tweets mentioning names of advanced persistent threats. The findings suggest that the K-means diversity-based query strategy outperformed both the uncertainty-based approach and the random data point selection, when the amount of labeled training data was limited. This study also evaluated the cost-effective active learning approach, which incorporates high-confidence data points into the training dataset. However, this was shown to be the least effective strategy.
Part of ISBN 9798350345346
QC 20240705