Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
The data volume of live corporate production logs is increasingly growing every day. On one hand, companies have to handle millions of data produced daily by their services which require high storage capacity. On the other hand, relevant information can be extracted from this massive amount of data and used for analysis according to different requirements, such as generating behavior patterns, detecting anomalies and making predictions. All of these can be achieved by machine learning and data mining techniques where the distributed platforms provide the computation ability and memory storage capacity for data intensive processing. Services such as payment monitoring in a company are very sensitive and require fast anomaly detection over streams of transactions. However, traditional anomaly detection techniques using distributed batch processing platforms such as Hadoop is very expensive to run and the anomalies cannot be detected in real time.
In order to overcome this drawback, Distributed Stream Processing (DSP) platforms such as Storm have proven to be a more flexible and powerful tool for dealing with such streams. Furthermore, since the anomaly patterns in data streams are not predefined and may change over time, unsupervised learning algorithms such as clustering should be used first to output significant anomalies which contribute to forming and updating anomaly patterns. The real-time anomaly detection on new data streams can be established by such patterns. This thesis project is aiming at providing a distributed system on top of Storm combining both batch-based unsupervised learning and streaming rule-based methods to detect anomalies in Spotify payment transactions in real time.
The anomaly detection system implements k-means and DBSCAN clustering algorithms as an unsupervised learning module to find out anomalous behaviors from payment transaction streams. Based on those anomalies, the frequent item set algorithm estDec is implemented to extract anomaly patterns. Stratified Complex Event Processing (CEP) engines based on Esper get reconfigured with such patterns to do rule-based anomaly detection in real time over absolute time sliding windows. Experimental results indicate that such a complex system over a unified data flow pipeline is feasible to detect anomalies in real time by rule-based anomaly detection with CEP engine. Unsupervised learning methods can provide
light weighted batch (nearly real time) based anomaly detection but different factors heavily influence the performance. The rule-based method shows that it performs better in the heavy anomaly density scenario in terms of sensitivity and lower detection latency.
2014. , 86 p.