In the realm of cybersecurity, leveraging machine learning holds promise for advancing threat detection capabilities. Yet, the sheer volume of unlabeled data presents a challenging hurdle to efficient data management. This chapter delves into the efficacy of active learning methodologies in alleviating the burden of manual data labeling. By employing various query strategies, the study identifies the most informative unlabeled data points suitable for labeling. Examining the performance across different query strategies involved testing a transformer model's ability in discerning tweets referencing advanced persistent threats. In scenarios where labeled training data is scarce, the results suggest that the K-means diversity-based query strategy outperforms both the uncertainty-based approach and the random data point selection. Furthermore, the study investigated the cost-effective active learning paradigm, which integrates high-confidence data points into the training dataset. Surprisingly, this approach emerged as the least effective strategy. In summary, the findings not only explain the potential of active learning in cybersecurity, but also underscore the importance of strategic data selection in optimizing model performance.
Part of ISBN 9781032944623, 9781040323977
QC 20251216