kth.sePublications
Change search
Refine search result
1 - 66 of 66
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Alessandro, Sanvito
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Humans in the wild: NeRFs for Dynamic Scenes Modeling from In-the-Wild Monocular Videos with Humans2023Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Recent advancements in computer vision have led to the emergence of Neural Radiance Fields (NeRFs), a powerful tool for reconstructing photorealistic 3D scenes, even in dynamic settings. However, these methods struggle when dealing with human subjects, especially when the subject is partially obscured or not completely observable, resulting in inaccurate reconstructions of geometries and textures. To address this issue, this thesis evaluates state-of-the-art human modeling using implicit representations with partial observability of the subject. We then propose and test several novel methods to improve the generalization of these models, including the use of symmetry and Signed Distance Function (SDF) driven losses and leveraging prior knowledge from multiple subjects via a pre-trained model. Our results demonstrate that our proposed methods significantly improve the accuracy of the reconstructions, even in challenging ”in-the-wild” situations, both quantitatively and qualitatively. Our approach opens new opportunities for applications such as asset generation for video games and movies and improved simulations for autonomous driving scenarios from abundant in-the-wild monocular videos. In summary, our research presents a significant improvement to the state-of-the-art human modeling using implicit representations, with important implications for 3D Computer Vision (CV) and Neural Rendering and its applications in various industries.

    Download full text (pdf)
    fulltext
  • 2.
    Alkathiri, Abdul Aziz
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Decentralized Large-Scale Natural Language Processing Using Gossip Learning2020Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The field of Natural Language Processing in machine learning has seen rising popularity and use in recent years. The nature of Natural Language Processing, which deals with natural human language and computers, has led to the research and development of many algorithms that produce word embeddings. One of the most widely-used of these algorithms is Word2Vec. With the abundance of data generated by users and organizations and the complexity of machine learning and deep learning models, performing training using a single machine becomes unfeasible. The advancement in distributed machine learning offers a solution to this problem. Unfortunately, due to reasons concerning data privacy and regulations, in some real-life scenarios, the data must not leave its local machine. This limitation has lead to the development of techniques and protocols that are massively-parallel and data-private. The most popular of these protocols is federated learning. However, due to its centralized nature, it still poses some security and robustness risks. Consequently, this led to the development of massively-parallel, data private, decentralized approaches, such as gossip learning. In the gossip learning protocol, every once in a while each node in the network randomly chooses a peer for information exchange, which eliminates the need for a central node. This research intends to test the viability of gossip learning for large- scale, real-world applications. In particular, it focuses on implementation and evaluation for a Natural Language Processing application using gossip learning. The results show that application of Word2Vec in a gossip learning framework is viable and yields comparable results to its non-distributed, centralized counterpart for various scenarios, with an average loss on quality of 6.904%.

    Download full text (pdf)
    fulltext
  • 3.
    Aly, Mazen
    KTH, School of Information and Communication Technology (ICT).
    Automated Bid Adjustments in Search Engine Advertising2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    In digital advertising, major search engines allow advertisers to set bid adjustments on their ad campaigns in order to capture the valuation differences that are a function of query dimensions. In this thesis, a model that uses bid adjustments is developed in order to increase the number of conversions and decrease the cost per conversion. A statistical model is used to select campaigns and dimensions that need bid adjustments along with several techniques to determine their values since they can be between -90% and 900%. In addition, an evaluation procedure is developed that uses campaign historical data in order to evaluate the calculation methods as well as to validate different approaches. We study the problem of interactions between different adjustments and a solution is formulated. Real-time experiments showed that our bid adjustments model improved the performance of online advertising campaigns with statistical significance. It increased the number of conversions by 9%, and decreased the cost per conversion by 10%.

    Download full text (pdf)
    fulltext
  • 4.
    Amaya de la Pena, Ignacio
    KTH, School of Information and Communication Technology (ICT).
    Fraud detection in online payments using Spark ML2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Frauds in online payments cause billions of dollars in losses every year. To reduce them, traditional fraud detection systems can be enhanced with the latest advances in machine learning, which usually require distributed computing frameworks to handle the big size of the available data.

    Previous academic work has failed to address fraud detection in real-world environments. To fill this gap, this thesis focuses on building a fraud detection classifier on Spark ML using real-world payment data.

    Class imbalance and non-stationarity reduced the performance of our models, so experiments to tackle those problems were performed. Our best results were achieved by applying undersampling and oversampling on the training data to reduce the class imbalance. Updating the model regularly to use the latest data also helped diminishing the negative effects of non-stationarity.

    A final machine learning model that leverages all our findings has been deployed at Qliro, an important online payments provider in the Nordics. This model periodically sends suspicious purchase orders for review to fraud investigators, enabling them to catch frauds that were missed before.

    Download full text (pdf)
    fulltext
  • 5.
    Anghileri, Davide
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Using Player Modeling to Improve Automatic Playtesting2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    In this thesis we present two approaches to improve automatic playtesting using player modeling. By modeling various cohorts of players we are able to train Convolutional Neural Network based agents that simulate human gameplay using different strategies directly learnt from real player data. The goal is to use the developed agents to predict useful metrics of newly created game content.

    We validated our approaches using the game Candy Crush Saga, a non-deterministic match-three puzzle game with a huge search space and more than three thousand levels available. To the best of our knowledge this is the first time that player modeling is applied in a match-three puzzle game. Nevertheless, the presented approaches are general and can be extended to other games as well. The proposed methods are compared to a baseline approach that simulates gameplay using a single strategy learnt from random gameplay data. Results show that by simulating different strategies, our approaches can more accurately predict the level difficulty, measured as the players’ success rate, on new levels. Both the approaches improved the mean absolute error by 13% and the mean squared error by approximately 23% when predicting with linear regression models. Furthermore, the proposed approaches can provide useful insights to better understand the players and the game.

    Download full text (pdf)
    fulltext
  • 6.
    Aparicio Vázquez, Ignacio
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Venn Prediction for Survival Analysis: Experimenting with Survival Data and Venn Predictors2020Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The goal of this work is to expand the knowledge on the field of Venn Prediction employed with Survival Data. Standard Venn Predictors have been used with Random Forests and binary classification tasks. However, they have not been utilised to predict events with Survival Data nor in combination with Random Survival Forests. With the help of a Data Transformation, the survival task is transformed into several binary classification tasks. One key aspect of Venn Prediction are the categories. The standard number of categories is two, one for each class to predict. In this work, the usage of ten categories is explored and the performance differences between two and ten categories are investigated. Seven data sets are evaluated, and their results presented with two and ten categories. For the Brier Score and Reliability Score metrics, two categories offered the best results, while Quality performed better employing ten categories. Occasionally, the models are too optimistic. Venn Predictors rectify this performance and produce well-calibrated probabilities.

    Download full text (pdf)
    fulltext
  • 7.
    Arcidiacono, Claudio Salvatore
    KTH, School of Electrical Engineering and Computer Science (EECS).
    An empirical study on synthetic image generation techniques for object detectors2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Convolutional Neural Networks are a very powerful machine learning tool that outperformed other techniques in image recognition tasks. The biggest drawback of this method is the massive amount of training data required, since producing training data for image recognition tasks is very labor intensive. To tackle this issue, different techniques have been proposed to generate synthetic training data automatically. These synthetic data generation techniques can be grouped in two categories: the first category generates synthetic images using computer graphic software and CAD models of the objects to recognize; the second category generates synthetic images by cutting the object from an image and pasting it on another image. Since both techniques have their pros and cons, it would be interesting for industries to investigate more in depth the two approaches. A common use case in industrial scenarios is detecting and classifying objects inside an image. Different objects appertaining to classes relevant in industrial scenarios are often undistinguishable (for example, they all the same component). For these reasons, this thesis work aims to answer the research question “Among the CAD model generation techniques, the Cut-paste generation techniques and a combination of the two techniques, which technique is more suitable for generating images for training object detectors in industrial scenarios”. In order to answer the research question, two synthetic image generation techniques appertaining to the two categories are proposed.The proposed techniques are tailored for applications where all the objects appertaining to the same class are indistinguishable, but they can also be extended to other applications. The two synthetic image generation techniques are compared measuring the performances of an object detector trained using synthetic images on a test dataset of real images. The performances of the two synthetic data generation techniques used for data augmentation have been also measured. The empirical results show that the CAD models generation technique works significantly better than the Cut-Paste generation technique where synthetic images are the only source of training data (61% better),whereas the two generation techniques perform equally good as data augmentation techniques. Moreover, the empirical results show that the models trained using only synthetic images performs almost as good as the model trained using real images (7,4% worse) and that augmenting the dataset of real images using synthetic images improves the performances of the model (9,5% better).

    Download full text (pdf)
    fulltext
  • 8.
    Bajarunas, Kristupas
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Generative Adversarial Networks for Vehicle Trajectory Generation2022Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Deep learning models heavily rely on an abundance of data, and their performance is directly affected by data availability. In mobility pattern modeling, problems, such as next location prediction or flow prediction, are commonly solved using deep learning approaches. Despite advances in modeling techniques, complications arise when acquiring mobility data is limited by geographic factors and data protection laws. Generating highquality synthetic data is one of the solutions to get around at times when information is scarce. Trajectory generation is concerned with generating trajectories that can reproduce the spatial and temporal characteristics of the underlying original mobility patterns. The task of this project was to evaluate Generative Adversarial Network (GAN) capabilities to generate synthetic vehicle trajectory data. We extend the methodology of previous research on trajectory generation by introducing conditional trajectory duration labels and a model pretraining mechanism. The evaluation of generated trajectories consisted of a two-fold analysis. We perform qualitative analysis by visually inspecting generated trajectories and quantitative analysis by calculating the statistical distance between synthetic and original data distributions. The results indicate that extending the previous GAN methodology allows the novel model to generate trajectories statistically closer to the original data distribution. Nevertheless, a statistical base model has the best generative performance and is the only model to generate visually plausible results. We accredit the superior performance of the statistical base model to the highly predictive nature of vehicle trajectories, which must follow the road network and have the tendency to follow minimum distance routes. This research considered only one type of GAN-based model, and further research should explore other architecture alternatives to understand the potential of GAN-based models fully

    Download full text (pdf)
    fulltext
  • 9.
    Bereczki, Márk
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Graph Neural Networks for Article Recommendation based on Implicit User Feedback and Content2021Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Recommender systems are widely used in websites and applications to help users find relevant content based on their interests. Graph neural networks achieved state- of-the- art results in the field of recommender systems, working on data represented in the form of a graph. However, most graph- based solutions hold challenges regarding computational complexity or the ability to generalize to new users. Therefore, we propose a novel graph- based recommender system, by modifying Simple Graph Convolution, an approach for efficient graph node classification, and add the capability of generalizing to new users. We build our proposed recommender system for recommending the articles of Peltarion Knowledge Center. By incorporating two data sources, implicit user feedback based on pageview data as well as the content of articles, we propose a hybrid recommender solution. Throughout our experiments, we compare our proposed solution with a matrix factorization approach as well as a popularity- based and a random baseline, analyse the hyperparameters of our model, and examine the capability of our solution to give recommendations to new users who were not part of the training data set. Our model results in slightly lower, but similar Mean Average Precision and Mean Reciprocal Rank scores to the matrix factorization approach, and outperforms the popularity- based and random baselines. The main advantages of our model are computational efficiency and its ability to give relevant recommendations to new users without the need for retraining the model, which are key features for real- world use cases. 

    Download full text (pdf)
    fulltext
  • 10.
    Boiani, Filippo
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Blockchain Based Electronic Health Record Management For Mass Crisis Scenarios: A Feasibility Study2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Electronic Health Records (EHRs) are both crucial and sensitive as they contain essential information and are frequently shared among different parties including hospitals, pharmacies or private clinics. This information must remain correct, up to date, private, and accessible only to the authorized people. Moreover, the access must also be assured under special conditions mass crises like hurricanes or earthquakes where disruption, decentralized responses, and chaos could potentially lead to wrong procedures or even malicious behaviors.

    The introduction of blockchain a distributed ledger where the records are stored in a linked sequence of blocks and are theoretically difficult to delete or tamper with made possible to design and implement new solutions for more failure-resistant EHRs applications adopting a distributed and decentralized philosophy, in contrast with the central ones based on cloud infrastructures or even local solutions. In this context, this work provides a systematic study to understand whether permissioned blockchain implementations could be of any benefit to managing health records in emergency situations caused by natural disasters. After the design and implementation of a basic prototype for an EHRs management system in Hyperledger Fabric and the execution of a set of test cases based on the simulation of the Haiti earthquake of 2010, it was possible to discuss the benefits and tradeoffs that the system entails. The discussion focused on the performance parameters like throughput, latency, memory and CPU usage.

    The system allowed the patients and practitioners to share and access EHRs and be able to detect and react to the crisis situations. Moreover, it behaved correctly in the presence of malicious nodes assuring throughputs and latencies still lower, compared to current centralized systems like credit card payments, but already up to two orders of magnitude higher than permissionless blockchain implementations. Even though there is still a lot of work to do, the system represented by the prototype could be an interesting alternative for networks of healthcare companies to help ensuring the continuity of treatment while preserving privacy and confidentiality in extreme situations.

    Download full text (pdf)
    fulltext
  • 11.
    Cadarso Salamanca, Manuel
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Influence of different frequencies order in a multi-step LSTM forecast for crowd movement in the domains of transportation and retail2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    This thesis presents an approach to predict crowd movement in defined placesusing LSTM neural networks. Specifically, it analyses the influence that different frequencies of time series have in both the crowd forecast and the design of the architecture in the domains of transportation and retail. The architecture is also affected because changes in the frequency provokes an increment or decrement in the quantity of data and, therefore, the architecture should be adapted. Previous research in the field of crowd prediction has been mainly focused on anticipating the next movement of the crowd rather than defining the amount of people during a specific range of time in a particular place. These studies have used different techniques such as Random Forest or Feed-Forward neural networks in order to find out the influence that the different frequencies have in the results of the forecast. However, this thesis applies LSTM neural networks for analysing this influence and uses specific field-related techniques in order to find the best parameters for forecasting future crowd movement. The results show that the order of the frequency of a time series clearly affects the outcomes of the predictions in the field of transportation and retail, being this influence positive when the order of the frequency of time series is able to catch the shape of the frequency of the forecast. Therefore, taking into account the order of the frequency, the results of the forecast for the analyzed places show an improvement of 40% for SMAPE and 50% for RMSE compared to the Naive approach and other techniques. Furthermore, they point out that there is a relation between the order of the frequency and the components of the architectures.

    Download full text (pdf)
    fulltext
  • 12.
    Caliò, Filippo
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Advancing Keyword Clustering Techniques: A Comparative Exploration of Supervised and Unsupervised Methods: Investigating the Effectiveness and Performance of Supervised and Unsupervised Methods with Sentence Embeddings2023Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Clustering keywords is an important Natural Language Processing task that can be adopted by several businesses since it helps to organize and group related keywords together. By clustering keywords, businesses can better understand the topics their customers are interested in. This thesis project provides a detailed comparison of two different approaches that might be used for performing this task and aims to investigate whether having the labels associated with the keywords improves the clusters obtained. The keywords are clustered using both supervised learning, training a neural network and applying community detection algorithms such as Louvain, and unsupervised learning algorithms, such as HDBSCAN and K-Means. The evaluation is mainly based on metrics like NMI and ARI. The results show that supervised learning can produce better clusters than unsupervised learning. By looking at the NMI score, the supervised learning approach composed by training a neural network with Margin Ranking Loss and applying Kruskal achieves a slightly better score of 0.771 against the 0.693 of the unsupervised learning approach proposed, but by looking at the ARI score, the difference is more relevant. HDBSCAN achieves a lower score of 0.112 compared to the supervised learning approach with the Margin Ranking Loss (0.296), meaning that the clusters formed by HDBSCAN may lack meaningful structure or exhibit randomness. Based on the evaluation metrics, the study demonstrates that supervised learning utilizing the Margin Ranking Loss outperforms unsupervised learning techniques in terms of cluster accuracy. However, when trained with a BCE loss function, it yields less accurate clusters (NMI: 0.473, ARI: 0.108), highlighting that the unsupervised algorithms surpass this particular supervised learning approach.

    Download full text (pdf)
    fulltext
  • 13.
    Casals, Núria
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Style Transfer Paraphrasing for Consistency Training in Sentiment Classification2021Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Text data is easy to retrieve but often expensive to classify, which is why labeled textual data is a resource often lacking in quantity. However, the use of labeled data is crucial in supervised tasks such as text classification, but semi-supervised learning algorithms have shown that the use of unlabeled data during training has the potential to improve model performance, even in comparison to a fully supervised setting. One approach to do semi-supervised learning is consistency training, in which the difference between the prediction distribution of an original unlabeled example and its augmented version is minimized. This thesis explores the performance difference between two techniques for augmenting unlabeled data used for detecting sentiment in movie reviews. The study examines whether the use of augmented data through neural style transfer paraphrasing could achieve comparable or better performance than the use of data augmented through back-translation. Five writing styles were used to generate the augmented datasets: Conversational Speech, Romantic Poetry, Shakespeare, Tweets and Bible. The results show that applying neural style transfer paraphrasing as a data augmentation technique for unlabeled examples in a semi-supervised setting does not improve the performance for sentiment classification with any of the styles used in the study. However, the use of style transferred augmented data in the semi-supervised approach generally performs better than using a model trained in a supervised scenario, where orders of magnitude more labeled data are needed and no augmentation is conducted. The study reveals that the experimented semi-supervised approach is superior to the fully supervised setting but worse than the semi-supervised approach using back-translation. 

    Download full text (pdf)
    fulltext
  • 14.
    Chaliane Junior, Guilherme Dinis
    KTH, School of Information and Communication Technology (ICT).
    Churn Analysis in a Music Streaming Service: Predicting and understanding retention2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Churn analysis can be understood as a problem of predicting and understanding abandonment of use of a product or service. Different industries ranging from entertainment to financial investment, and cloud providers make use of digital platforms where their users access their product offerings. Usage often leads to behavioural trails being left behind. These trails can then be mined to understand them better, improve the product or service, and to predict churn. In this thesis, we perform churn analysis on a reallife data set from a music streaming service, Spotify AB, with different signals, ranging from activity, to financial, temporal, and performance indicators. We compare logistic regression, random forest, along with neural networks for the task of churn prediction, and in addition to that, a fourth approach combining random forests with neural networks is proposed and evaluated. Then, a meta-heuristic technique is applied over the data set to extract Association Rules that describe quantified relationships between predictors and churn. We relate these findings to observed patterns in aggregate level data, finding probable explanations to how specific product features and user behaviours lead to churn or activation. For churn prediction, we found that all three non-linear methods performed better than logistic regression, suggesting the limitation of linear models for our use case. Our proposed enhanced random forest model performed mildly better than conventional random forest.

    Download full text (pdf)
    fulltext
  • 15.
    Chen, Simin
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Topic discovery and document similarity via pre-trained word embeddings2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Throughout the history, humans continue to generate an ever-growing volume of documents about a wide range of topics. We now rely on computer programs to automatically process these vast collections of documents in various applications. Many applications require a quantitative measure of the document similarity. Traditional methods first learn a vector representation for each document using a large corpus, and then compute the distance between two document vectors as the document similarity.In contrast to this corpus-based approach, we propose a straightforward model that directly discovers the topics of a document by clustering its words, without the need of a corpus. We define a vector representation called normalized bag-of-topic-embeddings (nBTE) to encapsulate these discovered topics and compute the soft cosine similarity between two nBTE vectors as the document similarity. In addition, we propose a logistic word importance function that assigns words different importance weights based on their relative discriminating power.Our model is efficient in terms of the average time complexity. The nBTE representation is also interpretable as it allows for topic discovery of the document. On three labeled public data sets, our model achieved comparable k-nearest neighbor classification accuracy with five stateof-art baseline models. Furthermore, from these three data sets, we derived four multi-topic data sets where each label refers to a set of topics. Our model consistently outperforms the state-of-art baseline models by a large margin on these four challenging multi-topic data sets. These works together provide answers to the research question of this thesis:Can we construct an interpretable document represen-tation by clustering the words in a document, and effectively and efficiently estimate the document similarity?

    Download full text (pdf)
    fulltext
  • 16.
    Cui, Zhexin
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Classification of Financial Transactions using Lightweight Memory Networks2022Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Various forms of fraud have substantially impacted our lives and caused considerable losses to some people. To reduce these losses, many researchers have devoted themselves to the study of fraud detection. After the development of fraud detection from expert-driven to data-driven systems, the scalability and accuracy of fraud detection have been improved considerably. However, most existing fraud detection methods focus on the feature extraction and classification of a certain transaction, ignoring the temporal and spatial long-term information from accounts. In this work, we propose to address these limitations by employing a lightweight memory network (LiMNet), which is a deep neural network that captures causal relations between temporal interactions. We evaluate our approach on two data sets, the Ether-Fraud dataset, and the Elliptic dataset. The former is a brand new dataset collected from Etherscan with data mining, and the latter is published by the homonymous company. As a set of raw collected data never used before, the Ether-Fraud dataset had some issues, such as huge variation among values and incomplete information. Therefore we have processed Ether-Fraud with data supplementation and normalization, which has solved these problems. A series of experiments were designed based on our analysis of the model and helped us to find the best hyper-parameter setting. Then, we compared the performance of the model with other baselines, and the results showed that Lightweight Memory Network (LiMNet) outperformed traditional algorithms on the Ether-Fraud dataset but was not good as the graph-based method on the Elliptic dataset. Finally, we summarized the experience of applying the model to fraud detection, the strengths and weaknesses of the model, and future directions for improvement.

    Download full text (pdf)
    fulltext
  • 17.
    Dallagiacoma, Marco
    KTH, School of Information and Communication Technology (ICT).
    Predicting the risk of accidents for downhill skiers2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    In recent years, the need for insurance coverage for downhill skiers is becoming increasingly important. The goal of this thesis work is to enable the development of innovative insurance services for skiers. Specifically, this project addresses the problem of estimating the probability for a skier to suffer injuries while skiing.This problem is addressed by developing and evaluating a number of machinelearning models. The models are trained on data that is commonly available to skiresorts, namely the history of accesses to ski-lifts, reports of accidents collected by ski-patrols, and weather-related information retrieved from publicly accessible weather stations. Both personal information about skiers and environmental variables are considered to estimate the risk. Additionally, an auxiliary model is developed to estimate the condition of the snow in a ski-resort from past weather data. A number of techniques to deal with the problems related to this task, such as the class imbalance and the calibration of probabilities, are evaluated and compared.The main contribution of this project is the implementation of machine learning models to predict the probability of accidents for downhill skiers. The obtained models achieve a satisfactory performance at estimating the risk of accidents for skiers, provided that the needed historical data for the target skiresorts is available. The biggest limitation encountered by this study is related to the relatively low volume and quality of available data, which suggests that there are opportunities for further enhancements if additional (and especially better) data is collected.

    Download full text (pdf)
    fulltext
  • 18.
    Dushi, Denis
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Using Deep Learning to Answer Visual Questions from Blind People2019Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    A natural application of artificial intelligence is to help blind people overcome their daily visual challenges through AI-based assistive technologies. In this regard, one of the most promising tasks is Visual Question Answering (VQA): the model is presented with an image and a question about this image. It must then predict the correct answer. Recently has been introduced the VizWiz dataset, a collection of images and questions originating from blind people. Being the first VQA dataset deriving from a natural setting, VizWiz presents many limitations and peculiarities. More specifically, the characteristics observed are the high uncertainty of the answers, the conversational aspect of questions, the relatively small size of the datasets and ultimately, the imbalance between answerable and unanswerable classes. These characteristics could be observed, individually or jointly, in other VQA datasets, resulting in a burden when solving the VQA task. Particularly suitable to address these aspects of the data are data science pre-processing techniques. Therefore, to provide a solid contribution to the VQA task, we answered the research question “Can data science pre-processing techniques improve the VQA task?” by proposing and studying the effects of four different pre-processing techniques. To address the high uncertainty of answers we employed a pre-processing step in which it is computed the uncertainty of each answer and used this measure to weight the soft scores of our model during training. The adoption of an “uncertainty-aware” training procedure boosted the predictive accuracy of our model of 10% providing a new state-of-the-art when evaluated on the test split of the VizWiz dataset. In order to overcome the limited amount of data, we designed and tested a new pre-processing procedure able to augment the training set and almost double its data points by computing the cosine similarity between answers representation. We addressed also the conversational aspect of questions collected from real world verbal conversations by proposing an alternative question pre-processing pipeline in which conversational terms are removed. This led in a further improvement: from a predictive accuracy of 0.516 with the standard question processing pipeline, we were able to achieve 0.527 predictive accuracy when employing the new pre-processing pipeline. Ultimately, we addressed the imbalance between answerable and unanswerable classes when predicting the answerability of a visual question. We tested two standard pre-processing techniques to adjust the dataset class distribution: oversampling and undersampling. Oversampling provided an albeit small improvement in both average precision and F1 score.

    Download full text (pdf)
    fulltext
  • 19.
    Díaz González, Fernando
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Federated Learning for Time Series Forecasting Using LSTM Networks: Exploiting Similarities Through Clustering2019Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Federated learning poses a statistical challenge when training on highly heterogeneous sequence data. For example, time-series telecom data collected over long intervals regularly shows mixed fluctuations and patterns. These distinct distributions are an inconvenience when a node not only plans to contribute to the creation of the global model but also plans to apply it on its local dataset. In this scenario, adopting a one-fits-all approach might be inadequate, even when using state-of-the-art machine learning techniques for time series forecasting, such as Long Short-Term Memory (LSTM) networks, which have proven to be able to capture many idiosyncrasies and generalise to new patterns. In this work, we show that by clustering the clients using these patterns and selectively aggregating their updates in different global models can improve local performance with minimal overhead, as we demonstrate through experiments using realworld time series datasets and a basic LSTM model.

    Download full text (pdf)
    fulltext
  • 20.
    Elango, Veeresh
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Change Point Detection in Sequential Sensor Data using Recurrent Neural Networks2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Change-point detection is the problem of recognizing the abrupt variations in sequential data. This covers a wide range of real world problems within medical, meteorology and automotive industry, and has been actively addressed in the community of statistics and data mining. In the automotive industry, sequential data is collected from various components of the vehicles. The changes in the underlying distribution of the sequential data might indicate component failure, sensor degradation or different activity of the vehicle, which explains the need for detecting these deviations in this industry. The research question of this thesis focuses on how different architectures of the recurrent neural network (RNN) perform in detecting the change points of sequential sensor data. In this thesis, the sliding window method was utilised to represent the variable sequence length into fixed length. Then this fixed length sequences were provided to many input single output (MISO) and many input many output (MIMO) architectures of RNN to perform two different tasks such as sequence detection, where the position of the change point in the sequence is recognized and sequence classification, where the sequence is checked for the presence of a change point. The stacking ensemble technique was employed to combine results of sequence classification with the sequence detection to further enhance the performance. The result of the thesis shows that the MIMO architecture has higher precision than recall whereas MISO architecture has higher recall than precision but both having almost similar f1-score. The ensemble technique exhibit a boost in the performance of both the architectures.

    Download full text (pdf)
    fulltext
  • 21.
    Fallahtoori, Sahar
    KTH, School of Information and Communication Technology (ICT).
    Distributed Graph Clustering: Study of DiDiC and Some Simpler Forms 2015Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The size of global electronic data in need of storage and retrieval is growing with an increasing rate. As a result of this growth, the development of technologies to process such data is a necessity. The data is developing in both complexity and connectivity, particularly for social networks. Connectivity of data means that the records to be stored are highly interdependent. Conventional relational databases are poorly suited for processing highly connected data. On the contrary, graph databases are inherently suited for such dependencies. This is mainly due to the fact that graph databases try to preserve locality and store adjacent records close to one another. This allows retrieval of adjacent elements, regardless of graph size, in constant time. Furthermore, with the everyday growth of data volume these databases won’t fit on single server any longer and need more (distributed) resources. Thus, partitioning of the data to be stored is required.  Graph partitioning, based on its aim, can be divided into two major subcategories; a) Balanced partitioning where the aim is to find a predefined, N, number of equally sized clusters and b) Community detection where the aim is to find all underlying dense subgraphs. In this thesis we investigate and improve one particular graph partitioning algorithm, namely DiDiC, which is designed for balanced partitioning. DiDiC is short for diffusive and distributed graph partitioning. The algorithm is independently implemented in this thesis. The main testbeds of our work are real-world social network graphs such as Wikipedia or Facebook and synthetically generated graphs. DiDiC's various aspects and performance are further examined in different situations (i.e. types of graph) and using various sets of parameters (i.e. DiDiC hyperparameters). Our experiments indicate that DiDiC fails to partition the input graphs to the desired number of clusters in more than 70% of cases. In most failure cases it returns the whole graph as a single cluster. We noticed that the diffusive aspects of DiDiC is minimally constrained. Inspired by these observations, we propose two diffusive variants of the DiDiC to address this issue and consequently improve the success rate. This is done mainly by constraining the diffusive aspect of DiDiC. The modifications are straightforward to implement and can be easily incorporated into existing graph databases. We show our modifications consistently outperforms DiDiC with a margin of ~30% success rate in various scenarios. The different scenarios include various sizes of graphs, with different connectivity and structure of underlying clusters. Furthermore, we demonstrate the effectiveness 5   of DiDiC in discovering underlying high density regions of graph a.k.a. “community detection”. In fact, we show that it is more successful in “community detection” (60% success rate) than "balanced clustering" (35% success rate). Finally, we investigate the importance of random initialization of DiDiC algorithm. We observe, while different initialization (and keeping the best performing one) can help the final performance, there is a diminishing return when the algorithm is randomly initialized more than 4 times.

    Download full text (pdf)
    fulltext
  • 22.
    Fu, Xinye
    KTH, School of Information and Communication Technology (ICT).
    Building Evolutionary Clustering Algorithms on Spark2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Evolutionary clustering (EC) is a kind of clustering algorithm to handle the noise of time-evolved data. It can track the truth drift of clustering across time by considering history. EC tries to make clustering result fit both current data and historical data/model well, so each EC algorithm defines snapshot cost (SC) and temporal cost (TC) to reflect both requests. EC algorithms minimize both SC and TC by different methods, and they have different ability to deal with a different number of cluster, adding/deleting nodes, etc.Until now, there are more than 10 EC algorithms, but no survey about that. Therefore, a survey of EC is written in the thesis. The survey first introduces the application scenario of EC, the definition of EC, and the history of EC algorithms. Then two categories of EC algorithms model-level algorithms and data-level algorithms are introduced oneby-one. What’s more, each algorithm is compared with each other. Finally, performance prediction of algorithms is given. Algorithms which optimize the whole problem (i.e., optimize change parameter or don’t use change parameter to control), accept a change of cluster number perform best in theory.EC algorithm always processes large datasets and includes many iterative data-intensive computations, so they are suitable for implementing on Spark. Until now, there is no implementation of EC algorithm on Spark. Hence, four EC algorithms are implemented on Spark in the project. In the thesis, three aspects of the implementation are introduced. Firstly, algorithms which can parallelize well and have a wide application are selected to be implemented. Secondly, program design details for each algorithm have been described. Finally, implementations are verified by correctness and efficiency experiments.

    Download full text (pdf)
    fulltext
  • 23.
    Garcia Bernal, Daniel
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Decentralizing Large-Scale Natural Language Processing with Federated Learning2020Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Natural Language Processing (NLP) is one of the most popular and visible forms of Artificial Intelligence in recent years. This is partly because it has to do with a common characteristic of human beings: language. NLP applications allow to create new services in the industrial sector in order to offer new solutions and provide significant productivity gains. All of this has happened thanks to the rapid progression of Deep Learning models. Large scale contextual representation models, such asWord2Vec, ELMo and BERT, have significantly advanced NLP in recently years. With these latest NLP models, it is possible to understand the semantics of text to a degree never seen before. However, they require large amounts of text data to process to achieve high-quality results. This data can be gathered from different sources, but one of the main collection points are devices such as smartphones, smart appliances and smart sensors. Lamentably, joining and accessing all this data from multiple sources is extremely challenging due to privacy and regulatory reasons. New protocols and techniques have been developed to solve this limitation by training models in a massively distributed manner taking advantage of the powerful characteristic of the devices that generates the data. Particularly, this research aims to test the viability of training NLP models, in specific Word2Vec, with a massively distributed protocol like Federated Learning. The results show that FederatedWord2Vecworks as good as Word2Vec is most of the scenarios, even surpassing it in some semantics benchmark tasks. It is a novel area of research, where few studies have been conducted, with a large knowledge gap to fill in future researches.

    Download full text (pdf)
    fulltext
  • 24.
    Ghandeharioon, Cosar
    KTH, School of Electrical Engineering and Computer Science (EECS).
    An evaluation of deep neural network approaches for traffic speed prediction2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The transportation industry has a significant effect on the sustainability and development of a society. Learning traffic patterns, and predicting the traffic parameters such as flow or speed for a specific spatiotemporal point is beneficial for transportation systems. For instance, intelligent transportation systems (ITS) can use forecasted results to improve services such as driver assistance systems. Furthermore, the prediction can facilitate urban planning by making management decisions data driven.

    There are several prediction models for time series regression on traffic data to predict the average speed for different forecasting horizons. In this thesis work, we evaluated Long Short-Term Memory (LSTM), one of the recurrent neural network models and Neural decomposition (ND), a neural network that performs Fourier-like decomposition. The results were compared with the ARIMA model. The persistent model was chosen as a baseline for the evaluation task. We proposed two new criteria in addition to RMSE and r2, to evaluate models for forecasting highly variable velocity changes. The dataset was gathered from highway traffic sensors around the E4 in Stockholm, taken from the “Motorway Control System” (MCS) operated by Trafikverket.

    Our experiments show that none of the models could predict the highly variable velocity changes at the exact times they happen. The reason was that the adjacent local area had no indications of sudden changes in the average speed of vehicles passing the selected sensor. We also conclude that traditional ML metrics of RMSE and r2 could be augmented with domain specific measures.

    Download full text (pdf)
    fulltext
  • 25.
    Giaretta, Lodovico
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Pushing the Limits of Gossip-Based Decentralised Machine Learning2019Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Recent years have seen a sharp increase in the ubiquity and power of connected devices, such as smartphones, smart appliances and smart sensors. These de- vices produce large amounts of data that can be extremely precious for training larger, more advanced machine learning models. Unfortunately, it is some- times not possible to collect and process these datasets on a central system, due either to their size or to the growing privacy requirements of digital data handling.To overcome this limit, researchers developed protocols to train global models in a decentralised fashion, exploiting the computational power of these edge devices. These protocols do not require any of the data on the device to be shared, relying instead on communicating partially-trained models.Unfortunately, real-world systems are notoriously hard to control, and may present a wide range of challenges that are easily overlooked in academic stud- ies and simulations. This research analyses the gossip learning protocol, one of the main results in the area of decentralised machine learning, to assess its applicability to real-world scenarios.Specifically, this work identifies the main assumptions built into the pro- tocol, and performs carefully-crafted simulations in order to test its behaviour when these assumptions are lifted. The results show that the protocol can al- ready be applied to certain environments, but that it fails when exposed to certain conditions that appear in some real-world scenarios. In particular, the models trained by the protocol may be biased towards the data stored in nodes with faster communication speeds or a higher number of neighbours. Further- more, certain communication topologies can have a strong negative impact on the convergence speed of the models.While this study also suggests effective mitigations for some of these is- sues, it appears that the gossip learning protocol requires further research ef- forts, in order to ensure a wider industrial applicability.

    Download full text (pdf)
    fulltext
  • 26.
    Grana Gutiérrez, Braulio
    KTH, School of Information and Communication Technology (ICT).
    Dataset versioning for Hops File System: Snapshotting solution for reliable and reproducible data science experiments2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    As the awareness of the potential of Big Data arises, more and more companies are starting to create their own Data Science divisions and their projects are becoming big and complex handled by big multidisciplinary teams. Furthermore, with the expansion of fields such as Deep Learning, Data Science is becoming a very popular research field both in companies and universities.

    In this context it becomes crucial for Data Scientists to be able to reproduce their experiments and test them against previous models developed in previous versions of a dataset. This Master Thesis project presents the design and implementation of a snapshotting system for the distributed File System HopsFS based on Apache HDFS and developed at the Swedish Institute of Computer Science (SICS).

    This project improves on previous solutions designed for both HopsFS and HDFS by solving problems such as the handling of incomplete blocks in snapshots while also adding new features such as the automatic snapshots to allow users to undo the last few changes made in a file.

    Finally, an analysis of the implementation was performed in order to compare it to the previous state of HopsFS and calculate the impact of the solution on the different operations performed by the system. Said analysis showed an increase of around 40% in the time needed to perform operations such as read and write with different workloads due mostly to the new database queries used in this solution.

    Download full text (pdf)
    fulltext
  • 27.
    Heyder, Jakob Wendelin
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Knowledge Base Augmentation from Spreadsheet Data: Combining layout inference with multimodal candidate classification2020Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Spreadsheets compose a valuable and notably large dataset of documents within many enterprise organizations and on the Web. Although spreadsheets are intuitive to use and equipped with powerful functionalities, extraction and transformation of the data remain a cumbersome and mostly manual task. The great flexibility they provide to the user results in data that is arbitrarily structured and hard to process for other applications. In this paper, we propose a novel architecture that combines supervised layout inference and multimodal candidate classification to allow knowledge base augmentation from arbitrary spreadsheets. In our design, we consider the need for repairing misclassifications and allow for verification and ranking of ambiguous candidates. We evaluate the performance of our system on two datasets, one with single-table spreadsheets, another with spreadsheets of arbitrary format. The evaluation result shows that the proposed system achieves similar performance on single-table spreadsheets compared to state-of-the-art rule-based solutions. Additionally, the flexibility of the system allows us to process arbitrary spreadsheet formats, including horizontally and vertically aligned tables, multiple worksheets, and contextualizing metadata. This was not possible with existing purely text-based or table-based solutions. The experiments demonstrate that it can achieve high effectiveness with an F1 score of 95.71 on arbitrary spreadsheets that require the interpretation of surrounding metadata. The precision of the system can be further increased by applying candidate schema-matching based on semantic similarity of column headers.

    Download full text (pdf)
    fulltext
  • 28.
    Kaas Johansen, Andreas
    KTH, School of Information and Communication Technology (ICT).
    Exploring consensus mediating arguments in online debates2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    This work presents a first venture into the search for features that define the rhetorical strategy known as Rogerian rhetoric. Rogerian rhetoric is a conflictsolving rhetorical strategy intended to find common ground instead of polarizing debates further by presenting strong arguments and counter arguments, as is often done in debates. The goal of the thesis is to lay the groundwork, a feature exploration and an evaluation of machine learning in this domain, for others tempted to model consensus-mediating arguments. In order to evaluate different sets of features statistical testing is applied to test if the distribution of certain features differ over consensus-mediating comments compared to nonconsensus mediating comments. Machine Learning in this domain is evaluated using support vector machines and different featuresets. The results show that on this data the consensus-mediating comments do have some characteristics that differ from other comments, some of which may generalize across debates. Next, as consensus-mediating arguments proved to be rare, these comments are a minority class, and in order to classify them using machine learning techniques overfitting needs to be addressed, the results suggest that the strategy applied to deal with overfitting is highly important. Due to the bias inherent in the hand annotated dataset the results should be considered provisional, more studies using debates from more domains with either expert or crowdsourced annotations are necessary to take the research further and produce results that generalize well.

    Download full text (pdf)
    fulltext
  • 29.
    Khelghatdoust, Mansour
    KTH, School of Electrical Engineering (EES), Communication Networks.
    Gossip based peer sampling in social overlays2014Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Performance of many P2P systems depends on the ability  to construct a ran-

    dom overlay network among the nodes. Current state-of-the-art techniques for constructing random overlays have an implicit  requirement that any two nodes in the system should always be able to communicate and establish a link be- tween them.  However, this is not the case in some of the environments where distributed systems are required to be deployed,  e.g, Decentralized Online So- cial Networks, Wireless networks, or networks with limited connectivity because of NATs/firewalls,  etc. In such restricted networks, every node is able to com- municate with only a predefined set of nodes and thus, the existing solutions for constructing random overlays are not applicable.In this thesis we propose a gossip based peer sampling service capable of running on top of such restricted networks and producing an on-the-fly random overlay.  The service provides ev- ery participating node with a set of uniform random nodes from the network, as well as efficient routing paths for reaching those nodes via the restricted net- work. We perform extensive experiments on four real-world networks and show that  the resulting overlays rapidly converge to random overlays. The results also exhibit that the constructed random overlays have self healing behaviour under churn and catastrophic failures.

    Download full text (pdf)
    fulltext
  • 30.
    Knoors, Daan
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Utility of Differentially Private Synthetic Data Generation for High-Dimensional Databases2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    When processing data that contains sensitive information, careful consideration is required with regard to privacy-preservation to prevent disclosure of confidential information. Privacy engineering enables one to extract valuable patterns, safely, without compromising anyone’s privacy. Over the last decade, academics have actively sought to find stronger definitions and methodologies to achieve data privacy while preserving the data utility. Differential privacy emerged and became the de facto standard for achieving data privacy and numerous techniques are continuously proposed based on this definition. One method in particular focuses on the generation of private synthetic databases, that mimic statistical patterns and characteristics of a confidential data source in a privacy-preserving manner. Original data format and utility is preserved in a new database that can be shared and analyzed safely without the risk of privacy violation. However, while this privacy approach sounds promising there has been little application beyond academic research. Hence, we investigate the potential of private synthetic data generation for real-world applicability. We propose a new utility evaluation framework that provides a unified approach upon which various algorithms can be assessed and compared. This framework extends academic evaluation methods by incorporating a user-oriented perspective and varying industry requirements, while also examining performance on real-world use cases. Finally, we implement multiple general-purpose algorithms and evaluate them based on our framework to ultimately determine the potential of private synthetic data generation beyond the academic domain.

    Download full text (pdf)
    fulltext
  • 31.
    Knoors, Daan Josephus
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Utility of Differentially Private Synthetic Data Generation for High-Dimensional Databases2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    When processing data that contains sensitive information, careful consideration is required with regard to privacy-preservation to prevent disclosure of confidential information. Privacy engineering enables one to extract valuable patterns, safely, without compromising anyone’s privacy. Over the last decade, academics have actively sought to find stronger definitions and methodologies to achieve data privacy while preserving the data utility. Differential privacy emerged and became the de facto standard for achieving data privacy and numerous techniques are continuously proposed based on this definition. One method in particular focuses on the generation of private synthetic databases, that mimic statistical patterns and characteristics of a confidential data source in a privacy-preserving manner. Original data format and utility is preserved in a new database that can be shared and analyzed safely without the risk of privacy violation. However, while this privacy approach sounds promising there has been little application beyond academic research. Hence, we investigate the potential of private synthetic data generation for real-world applicability. We propose a new utility evaluation framework that provides a unified approach upon which various algorithms can be assessed and compared. This framework extends academic evaluation methods by incorporating a user-oriented perspective and varying industry requirements, while also examining performance on real-world use cases. Finally, we implement multiple general-purpose algorithms and evaluate them based on our framework to ultimately determine the potential of private synthetic data generation beyond the academic domain.

    Download full text (pdf)
    fulltext
  • 32.
    Kowalczewski, Jakub
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Normalized conformalprediction for time series data2019Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Every forecast is valid only if proper prediction intervals are stated. Currently models focus mainly on point forecast and neglect the area of prediction intervals. The estimation of the error of the model is made and is applied to every prediction in the same way, whereas we could identify that every case is different and different error measure should be applied to every instance. One of the state-of-the-art techniques which can address this behaviour is conformal prediction with its variant of normalized conformal prediction. In this thesis we apply this technique into time series problems. The special focus is put to examine the technique of estimating the difficulty of every instance using the error of neighbouring instances. This thesis describes the entire process of adjusting time series data into normalized conformal prediction framework and the comparison with other techniques will be made. The final results do not show that aforementioned method is superior over an existing techniques in various setups different method performed the best. However, it is similar in terms of performance. Therefore, it is an interesting add-on to data science forecasting toolkit.

    Download full text (pdf)
    fulltext
  • 33.
    Labroski, Aleksandar
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Multi-view versus single-view machine learning for disease diagnosis in primary healthcare2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The work presented in this report considers and compares two different approaches of machine learning towards solving the problem of disease diagnosis prediction in primary healthcare: single-view and multi-view machine learning. In particular, the problem of disease diagnosis prediction refers to the issue of predicting a (possible) diagnosis for a given patient based on her past medical history. The problem area is extensive, especially considering the fact that there are over 14,400 unique possible diagnoses (grouped into22 high level categories) that can be considered as prediction targets. The approach taken in this work considers the high-level categories as prediction targets and attempts to use the two different machine learning techniques towards getting close to an optimal solution of the issue. The multi-view machine learning paradigm was chosen as an approach that can improve predictive performance of classifiers in settings where we have multiple heterogeneous data sources (different views of the same data), which is exactlyt he case here. In order to compare the single-view and multi-view machine learning paradigms (based on the concept of supervised learning), several different experiments are devised which explore the possible solution space under each paradigm. The work closely touches on other machine learning concepts such as ensemble learning, stacked generalization and dimensionality reduction-based learning. As we shall see, the results show that multiview stacked generalization is a powerful paradigm that can significantly improve the predictive performance in a supervised learning setting. The different models performance was evaluated using F1 scores and we have been able to observe an average increase of performance of 0.04 and a maximum increase of 0.114 F1 score points. The findings also show that approach of multi-view stacked ensemble learning is particularly well suited as a noise reduction technique and works well in cases where the feature data is expected to contain a notable amount of noise. This can be very beneficial and of interest to projects where the features are not manually chosen by domainexperts.

    Download full text (pdf)
    fulltext
  • 34.
    Lee, Zed Heeje
    KTH, School of Electrical Engineering and Computer Science (EECS).
    A graph representation of event intervals for efficient clustering and classification2020Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Sequences of event intervals occur in several application domains, while their inherent complexity hinders scalable solutions to tasks such as clustering and classification. In this thesis, we propose a novel spectral embedding representation of event interval sequences that relies on bipartite graphs. More concretely, each event interval sequence is represented by a bipartite graph by following three main steps: (1) creating a hash table that can quickly convert a collection of event interval sequences into a bipartite graph representation, (2) creating and regularizing a bi-adjacency matrix corresponding to the bipartite graph, (3) defining a spectral embedding mapping on the bi-adjacency matrix. In addition, we show that substantial improvements can be achieved with regard to classification performance through pruning parameters that capture the nature of the relations formed by the event intervals. We demonstrate through extensive experimental evaluation on five real-world datasets that our approach can obtain runtime speedups of up to two orders of magnitude compared to other state-of-the-art methods and similar or better clustering and classification performance.

    Download full text (pdf)
    fulltext
  • 35.
    Leonard, Stutzer
    KTH, School of Electrical Engineering and Computer Science (EECS).
    State Validation of Ethash-based Blockchains using a zk-SNARK-based Chain Relay2022Independent thesis Advanced level (degree of Master of Fine Arts (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    We present an Ethash-based blockchain relay that utilizes Off-Chain Computation (OCC) to validate block headers on-chain. Current work compromises on fundamental ideas of the blockchain concept: they either require a centralized entity, require a centralized Trusted Third Party (TTP) or are built on economic assumptions. That way, they try to circumvent the on-chain cost-heavy Ethash computation. We utilize Zero Knowledge Proofs (ZKPs) to outsource the Ethash validation to an Off-Chain Computation Framework (OCCF) and only verify the validity of the OCC on-chain. The required dataset for the Ethash validation is inserted into a merkle tree for computational feasibility. Additionally, we validate multiple block headers in batches to further minimize on-chain costs. The on-chain costs of our batch validation mechanism are minimal and constant since only the proof of an OCC is verified on-chain. Through merkle proofs we enable the efficient inclusion of intermediary block headers for any submitted batch. The OCC is feasible on average consumer hardware specifications. Our prototype verifies 5 block headers in a single proof using the ZoKrates framework. Compared to current approaches we only use 3.3% of the gas costs resulting in a highly scalable alternative that is trustless, distributed and has no economic assumptions. For future work, we propose to distribute the computational overhead of computing Ethash inside a ZKP through an off-chain distribution module. This is because we rely on the concurrent execution of the OCC by at least 36 active participants to catch up with the current state of the relay’s blockchain.

    Download full text (pdf)
    fulltext
  • 36.
    Li, Yuntao
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Federated Learning for Time Series Forecasting Using Hybrid Model2019Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Time Series data has become ubiquitous thanks to affordable edge devices and sensors. Much of this data is valuable for decision making. In order to use these data for the forecasting task, the conventional centralized approach has shown deficiencies regarding large data communication and data privacy issues. Furthermore, Neural Network models cannot make use of the extra information from the time series, thus they usually fail to provide time series specific results. Both issues expose a challenge to large-scale Time Series Forecasting with Neural Network models. All these limitations lead to our research question:Can we realize decentralized time series forecasting with a Federated Learning mechanism that is comparable to the conventional centralized setup in forecasting performance?In this work, we propose a Federated Series Forecasting framework, resolving the challenge by allowing users to keep the data locally, and learns a shared model by aggregating locally computed updates. Besides, we design a hybrid model to enable Neural Network models utilizing the extra information from the time series to achieve a time series specific learning. In particular, the proposed hybrid outperforms state-of-art baseline data-central models with NN5 and Ericsson KPI data. Meanwhile, the federated settings of purposed model yields comparable results to data-central settings on both NN5 and Ericsson KPI data. These results together answer the research question of this thesis.

    Download full text (pdf)
    fulltext
  • 37.
    Lin, Lyu
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Transformer-based Model for Molecular Property Prediction with Self-Supervised Transfer Learning2020Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Molecular property prediction has a vast range of applications in the chemical industry. A powerful molecular property prediction model can promote experiments and production processes. The idea behind this degree program lies in the use of transfer learning to predict molecular properties. The project is divided into two parts. The first part is to build and pre-train the model. The model, which is constructed with pure attention-based Transformer Layer, is pre-trained through a Masked Edge Recovery task with large-scale unlabeled data. Then, the performance of this pre- trained model is tested with different molecular property prediction tasks and finally verifies the effectiveness of transfer learning.The results show that after self-supervised pre-training, this model shows its excellent generalization capability. It is possible to be fine-tuned with a short period and performs well in downstream tasks. And the effectiveness of transfer learning is reflected in the experiment as well. The pre-trained model not only shortens the task- specific training time but also obtains better performance and avoids overfitting due to too little training data for molecular property prediction.

    Download full text (pdf)
    fulltext
  • 38.
    Lindener, Tobias
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Enabling Arbitrary Memory Constraint Standing Queries on Distributed Stream Processors using Approximate Algorithms2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Relational algebra and SQL have been a standard in declarative analytics for decades. Yet, at web-scale, even simple analytics queries can prove challenging within Distributed Stream Processing environments. Two examples of such queries are "count" and "count distinct". Since aforementioned queries require persistence of all keys (the value identifying an element), such queries would result in continuously increasing memory demand. Through approximation techniques with fixed-size memory layouts, said tasks are feasible and potentially more resource efficient within streaming systems. Within this thesis, (1) the advantages of approximate queries on distributed stream processing are demonstrated. Furthermore, (2) the resource efficiency as well as (3) challenges of approximation techniques, and (4) dataset dependent optimizations are presented. The prototype is implemented using the Yahoo Data Sketch library on Apache Flink. Based on the evaluation results and the experiences with the prototype, potential improvements like deeper integration into the streaming framework are presented. Throughout the analysis, the combination of approximate algorithms and distributed stream processing shows promising results depending on the dataset and the required accuracy.

    Download full text (pdf)
    fulltext
  • 39.
    Ljubenkov, Davor
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Optimizing Bike Sharing System Flows using Graph Mining, Convolutional and Recurrent Neural Networks2019Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    A Bicycle-sharing system (BSS) is a popular service scheme deployed in cities of different sizes around the world. Although docked bike systems are its most popular model used, it still experiences a number of weaknesses that could be optimized by investigating bike sharing network properties and evolution of obtained patterns.Efficiently keeping bicycle-sharing system as balanced as possible is the main problem and thus, predicting or minimizing the manual transportation of bikes across the city is the prime objective in order to save logistic costs for operating companies.The purpose of this thesis is two-fold; Firstly, it is to visualize bike flow using data exploration methods and statistical analysis to better understand mobility characteristics with respect to distance, duration, time of the day, spatial distribution, weather circumstances, and other attributes. Secondly, by obtaining flow visualizations, it is possible to focus on specific directed sub-graphs containing only those pairs of stations whose mutual flow difference is the most asymmetric. By doing so, we are able to use graph mining and machine learning techniques on these unbalanced stations.Identification of spatial structures and their structural change can be captured using Convolutional neural network (CNN) that takes adjacency matrix snapshots of unbalanced sub-graphs. A generated structure from the previous method is then used in the Long short-term memory artificial recurrent neural network (RNN LSTM) in order to find and predict its dynamic patterns.As a result, we are predicting bike flows for each node in the possible future sub-graph configuration, which in turn informs bicycle-sharing system owners in advance to plan accordingly. This combination of methods notifies them which prospective areas they should focus on more and how many bike relocation phases are to be expected. Methods are evaluated using Cross validation (CV), Root mean square error (RMSE) and Mean average error (MAE) metrics. Benefits are identified both for urban city planning and for bike sharing companies by saving time and minimizing their cost.

    Download full text (pdf)
    fulltext
  • 40.
    Magnusson, John
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Finding time-based listening habits in users music listening history to lower entropy in data2021Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    In a world where information, entertainment and e-commerce are growing rapidly in terms of volume and options, it can be challenging for individuals to find what they want. Search engines and recommendation systems have emerged as solutions, guiding the users. A typical example of this is Spotify, a music streaming company that utilises users listening data and other derived metrics to provide personalised music recommendation. Spotify has a hypothesis that external factors affect users listening preferences and that some of these external factors routinely affect the users, such as workout routines and commuting to work. This work aims to find time- based listening habits in users’ music listening history to decrease the entropy in the data, resulting in a better understanding of the users. While this work primarily targets listening habits, the method can, in theory, be applied on any time series-based dataset. Listening histories were split into hour vectors, vectors where each element represents the distribution of a label/genre played during an hour. The hour vectors allowed for a good representation of the data independent of the volume. In addition, it allowed for clustering, making it possible to find hours where similar music was played. Hour slots that routinely appeared in the same cluster became a profile, highlighting a habit. In the final implementation, a user is represented by a profile vector allowing different profiles each hour of a week. Several users were profiled with the proposed approach and evaluated in terms of decrease in Shannon entropy when profiled compared to when not profiled. On average, user entropy dropped by 9% with highs in the 50% and a small portion of users not experiencing any decrease. In addition, the profiling was evaluated by measuring cosine similarity across users listening history, resulting in a correlation between gain in cosine similarity and decrease in entropy. In conclusion, users become more predictable and interpretable when profiled. This knowledge can be used to understand users better or as a feature for recommender systems and other analysis. 

    Download full text (pdf)
    fulltext
  • 41.
    Magnússon, Fannar
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Improving Artist Content Matching with Stacking: A comparison of meta-level learners for stacked generalization2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Using automatic methods to assign incoming tracks and albums from multiple sources to artists entities in a digital rights management company, where no universal artist identifier is available and artist names can be ambiguous, is a challenging problem. In this work we propose to use stacked generalization to combine the predictions of heterogeneous classifiers for an improved quality of artist content matching on two datasets from a digital rights management company. We compare the performance of using a nonlinear meta-level learner to a linear meta-level learner for the stacked generalization on the two datasets, as well as on eight additional datasets to see how well our results general-

    ize. We conduct experiments and evaluate how the different

    meta-level learners perform, using the base learners’ class

    probabilities or a combination of the base learners’ class probabilities and original input features as meta-features.

    Our results indicate that stacking with a non-linear meta-level learner can improve predictions on the artist chooser problem. Furthermore, our results indicate that when using a linear meta-level learner for stacked generalization, using the base learners’ class probabilities as metafeatures works best, while using a combination of the base learners’ class probabilities and the original input features as meta-features works best when using a non-linear metalevel learner. Among all the evaluated stacking approaches, stacking with a non-linear meta-level learner, using a combination of the base learners’ class probabilities and the original input features as meta-features, performs the best in our experiments over the ten evaluation datasets.

    Download full text (pdf)
    fulltext
  • 42.
    Marzo i Grimalt, Núria
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Natural Language Processing Model for Log Analysis to Retrieve Solutions For Troubleshooting Processes2021Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    In the telecommunications industry, one of the most time-consuming tasks is troubleshooting and the resolution of Trouble Report (TR) tickets. This task involves the understanding of textual data which can be challenging due to its domain- and company-specific features. The text contains many abbreviations, typos, tables as well as numerical information. This work tries to solve the issue of retrieving solutions for new troubleshooting reports in an automated way by using a Natural Language Processing (NLP) model, in particular Bidirectional Encoder Representations from Transformers (BERT)- based approaches. It proposes a text ranking model that, given a description of a fault, can rank the best possible solutions to that problem using answers from past TRs. The model tackles the trade-off between accuracy and latency by implementing a multi-stage BERT-based architecture with an initial retrieval stage and a re-ranker stage. Having a model that achieves a desired accuracy under a latency constraint allows it to be suited for industry applications. The experiments to evaluate the latency and the accuracy of the model have been performed on Ericsson’s troubleshooting dataset. The evaluation of the proposed model suggest that it is able to retrieve and re-rank solution for TRs with a significant improvement compared to a non-BERT model. 

    Download full text (pdf)
    fulltext
  • 43.
    Montesi, Daniele
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Repeating Link Prediction over Dynamic Graphs through Edge Embeddings: New method optimizing node pair proximity in the embedding space for link prediction2020Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Transactional graphs are networks made of money transactions and have gained particular attention of researchers thanks to their numerous applications in financial institutions. When considering the Link Prediction task over dynamic graphs, it is possible to use the evolutionary behavior of the network to predict which are the future links that will occur in the next future in the graph. Dynamic Link Prediction has been explored widely in the past years, however, the majority of these works focus on discovering new edges, while very few works focus on the repeating edges, i.e. links that continuously vanish and reappear in the network. In this work, we first study the literature for link prediction in the static settlement, then, we focus on dynamic link prediction, underlining the strengths and weaknesses of every approach studied. We discover that traditional methods do not work well with repeating links as they are unable to encode temporal patterns associated with the edges while also considering the topological graph features. We propose a novel method, Temporal Edge Embedding Neural Network (TEEN), which is based on a deep learning architecture able to jointly optimize the prediction of the correct edge labels and the proximity of two nodes’ pairs in their latent space at every time step. Our solution benefits of node embeddings created with deep encoders from where an edge embedding is created for every time step. TEEN is able to outperform State of the Art models by over 8% on AUC and F1-Score. We conclude that our approach brings significant improvements in the scenario of transactional graphs. More future work should be done in different scenarios to validate the algorithm’s effectiveness in the general case.

  • 44.
    Mrowczynski, Piotr
    KTH, School of Electrical Engineering and Computer Science (EECS).