Independent thesis Advanced level (professional degree), 20 credits / 30 HE credits
Encoding Sequential Structures using Kernels
Sequential data-types represent a natural model for information in many fields, such as Time-Series Analysis and Computational Biology. Having a very dynamic nature, sequential data still represents a challenge to modern learning methods, which struggle to fully integrate it into their mechanisms. Kernel Methods offer a practical and accessible framework for the integration of structured data into well-established Machine Learning algorithms, with the only requirement being the definition of an adequate kernel function which reflects the task at hand.
Mainstream kernel functions for strings, particular instances of sequences, are mostly based on explicitly constructed feature vectors which describe the input by its relationship with respect to a set of references, usually n-grams. More recently, new efforts have been made to develop more sophisticated and dedicated solutions: the current state-of-the-art uses paths to access alignment-based matchings between sequences.
This Master project aims at broadening the limited variety of existing kernels for sequences. Inspired by modern methods, and with consideration of the issues they suffer from, this project has resulted in the development of 3 novel kernel functions.
A set of experiments is set up to perform a qualitative evaluation of the proposed methods. For this purpose, univariate artificial data sequences are created with various levels of noise to simulate real-world scenarios. The experiments include tasks of classification, evaluated through appropriate cross-validation, and regression, along with visual representations of the feature space induced by the kernels. The results of the experiments show substantial improvements in classification, generalisation and robustness to various levels of noise.