Change search
ReferencesLink to record
Permanent link

Direct link
On practical machine learning and data analysis
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
2008 (English)Doctoral thesis, monograph (Other scientific)
Abstract [en]

This thesis discusses and addresses some of the difficulties associated with practical machine learning and data analysis. Introducing data driven meth- ods in e. g. industrial and business applications can lead to large gains in productivity and efficiency, but the cost and complexity are often overwhelm- ing. Creating machine learning applications in practise often involves a large amount of manual labour, which often needs to be performed by an experi- enced analyst without significant experience with the application area. We will here discuss some of the hurdles faced in a typical analysis project and suggest measures and methods to simplify the process.

One of the most important issues when applying machine learning meth- ods to complex data, such as e. g. industrial applications, is that the processes generating the data are modelled in an appropriate way. Relevant aspects have to be formalised and represented in a way that allow us to perform our calculations in an efficient manner. We present a statistical modelling framework, Hierarchical Graph Mixtures, based on a combination of graphi- cal models and mixture models. It allows us to create consistent, expressive statistical models that simplify the modelling of complex systems. Using a Bayesian approach, we allow for encoding of prior knowledge and make the models applicable in situations when relatively little data are available.

Detecting structures in data, such as clusters and dependency structure, is very important both for understanding an application area and for speci- fying the structure of e. g. a hierarchical graph mixture. We will discuss how this structure can be extracted for sequential data. By using the inherent de- pendency structure of sequential data we construct an information theoretical measure of correlation that does not suffer from the problems most common correlation measures have with this type of data.

In many diagnosis situations it is desirable to perform a classification in an iterative and interactive manner. The matter is often complicated by very limited amounts of knowledge and examples when a new system to be diag- nosed is initially brought into use. We describe how to create an incremental classification system based on a statistical model that is trained from empiri- cal data, and show how the limited available background information can still be used initially for a functioning diagnosis system.

To minimise the effort with which results are achieved within data anal- ysis projects, we need to address not only the models used, but also the methodology and applications that can help simplify the process. We present a methodology for data preparation and a software library intended for rapid analysis, prototyping, and deployment.

Finally, we will study a few example applications, presenting tasks within classification, prediction and anomaly detection. The examples include de- mand prediction for supply chain management, approximating complex simu- lators for increased speed in parameter optimisation, and fraud detection and classification within a media-on-demand system.

Place, publisher, year, edition, pages
Stockholm: KTH , 2008. , ix, 217 p.
Trita-CSC-A, ISSN 1653-5723 ; 2008-11
, SICS Dissertation Series, ISSN 1101-1335 ; 49
National Category
Computer Science
URN: urn:nbn:se:kth:diva-4788ISBN: 978-91-7178-993-3OAI: diva2:13955
Public defence
2008-06-11, Sal FD5, AlbaNova, Roslagstullsbacken 21, Stockholm, 13:00
QC 20100727Available from: 2008-06-02 Created: 2008-06-02 Last updated: 2010-07-27Bibliographically approved

Open Access in DiVA

fulltext(2167 kB)432 downloads
File information
File name FULLTEXT01.pdfFile size 2167 kBChecksum SHA-1
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Gillblad, Daniel
By organisation
Computational Biology, CB
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 432 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 477 hits
ReferencesLink to record
Permanent link

Direct link