Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Efficient Distributed Pipelines for Anomaly Detection on Massive Production Logs
KTH, School of Information and Communication Technology (ICT).
2014 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The data volume of live corporate production logs is increasingly growing every day. On one hand, companies have to handle millions of data produced daily by their services which require high storage capacity. On the other hand, relevant information can be extracted from this massive amount of data and used for analysis according to different requirements, such as generating behavior patterns, detecting anomalies and making predictions. All of these can be achieved by machine learning and data mining techniques where the distributed platforms provide the computation ability and memory storage capacity for data intensive processing. Services such as payment monitoring in a company are very sensitive and require fast anomaly detection over streams of transactions. However, traditional anomaly detection techniques using distributed batch processing platforms such as Hadoop is very expensive to run and the anomalies cannot be detected in real time.

In order to overcome this drawback, Distributed Stream Processing (DSP) platforms such as Storm have proven to be a more flexible and powerful tool for dealing with such streams. Furthermore, since the anomaly patterns in data streams are not predefined and may change over time, unsupervised learning algorithms such as clustering should be used first to output significant anomalies which contribute to forming and updating anomaly patterns. The real-time anomaly detection on new data streams can be established by such patterns. This thesis project is aiming at providing a distributed system on top of Storm combining both batch-based unsupervised learning and streaming rule-based methods to detect anomalies in Spotify payment transactions in real time.

The anomaly detection system implements k-means and DBSCAN clustering algorithms as an unsupervised learning module to find out anomalous behaviors from payment transaction streams. Based on those anomalies, the frequent item set algorithm estDec is implemented to extract anomaly patterns. Stratified Complex Event Processing (CEP) engines based on Esper get reconfigured with such patterns to do rule-based anomaly detection in real time over absolute time sliding windows. Experimental results indicate that such a complex system over a unified data flow pipeline is feasible to detect anomalies in real time by rule-based anomaly detection with CEP engine. Unsupervised learning methods can provide

light weighted batch (nearly real time) based anomaly detection but different factors heavily influence the performance. The rule-based method shows that it performs better in the heavy anomaly density scenario in terms of sensitivity and lower detection latency.

Place, publisher, year, edition, pages
2014. , 86 p.
Series
TRITA-ICT-EX, 2014:134
National Category
Computer and Information Science
Identifiers
URN: urn:nbn:se:kth:diva-177365OAI: oai:DiVA.org:kth-177365DiVA: diva2:872469
Examiners
Available from: 2015-11-19 Created: 2015-11-19 Last updated: 2017-08-03Bibliographically approved

Open Access in DiVA

fulltext(1103 kB)28 downloads
File information
File name FULLTEXT01.pdfFile size 1103 kBChecksum SHA-512
919dd6e41c458ff5282182c8962c70b70acdda7746a781ff9b8be164dc0aeb67c5024e255e58d4332f937168a3422830b97f9df17bdcd257f3ca970f572876a1
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 28 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 244 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf