kth.sePublikationer
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Domain-agnostic discovery of similarities and concepts at scale
RISE.
RISE.
RISE.
2017 (Engelska)Ingår i: Knowledge and Information Systems, ISSN 0219-1377, E-ISSN 0219-3116, Vol. 51, nr 2, s. 531-560Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Appropriately defining and efficiently calculating similarities from large data sets are often essential in data mining, both for gaining understanding of data and generating processes and for building tractable representations. Given a set of objects and their correlations, we here rely on the premise that each object is characterized by its context, i.e., its correlations to the other objects. The similarity between two objects can then be expressed in terms of the similarity between their contexts. In this way, similarity pertains to the general notion that objects are similar if they are exchangeable in the data. We propose a scalable approach for calculating all relevant similarities among objects by relating them in a correlation graph that is transformed to a similarity graph. These graphs can express rich structural properties among objects. Specifically, we show that concepts—abstractions of objects—are constituted by groups of similar objects that can be discovered by clustering the objects in the similarity graph. These principles and methods are applicable in a wide range of fields and will be demonstrated here in three domains: computational linguistics, music, and molecular biology, where the numbers of objects and correlations range from small to very large.

Ort, förlag, år, upplaga, sidor
London: Springer, 2017. Vol. 51, nr 2, s. 531-560
Nyckelord [en]
Similarity discovery, Concept mining, Distributional semantics, Graph processing
Nationell ämneskategori
Datavetenskap (datalogi)
Identifikatorer
URN: urn:nbn:se:kth:diva-249992DOI: 10.1007/s10115-016-0984-2ISI: 000399408200007Scopus ID: 2-s2.0-84984793995OAI: oai:DiVA.org:kth-249992DiVA, id: diva2:1306826
Projekt
E2ECLOUDS
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), RIT10-0043
Anmärkning

QC 20190426

Tillgänglig från: 2019-04-25 Skapad: 2019-04-25 Senast uppdaterad: 2022-06-26Bibliografiskt granskad
Ingår i avhandling
1. Scalable Machine Learning through Approximation and Distributed Computing
Öppna denna publikation i ny flik eller fönster >>Scalable Machine Learning through Approximation and Distributed Computing
2019 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Machine learning algorithms are now being deployed in practically all areas of our lives. Part of this success can be attributed to the ability to learn complex representations from massive datasets. However, computational speed increases have not kept up with the increase in the sizes of data we want to learn from, leading naturally to algorithms that need to be resource-efficient and parallel. As the proliferation of machine learning continues, the ability for algorithms to adapt to a changing environment and deal with uncertainty becomes increasingly important.

In this thesis we develop scalable machine learning algorithms, with a focus on efficient, online, and distributed computation. We make use of approximations to dramatically reduce the computational cost of exact algorithms, and develop online learning algorithms to deal with a constantly changing environment under a tight computational budget. We design parallel and distributed algorithms to ensure that our methods can scale to massive datasets.

We first propose a scalable algorithm for graph vertex similarity calculation and concept discovery. We demonstrate its applicability to multiple domains, including text, music, and images, and demonstrate its scalability by training on one of the largest text corpora available. Then, motivated by a real-world use case of predicting the session length in media streaming, we propose improvements to several aspects of learning with decision trees. We propose two algorithms to estimate the uncertainty in the predictions of online random forests. We show that our approach can achieve better accuracy than the state of the art while being an order of magnitude faster. We then propose a parallel and distributed online tree boosting algorithm that maintains the correctness guarantees of serial algorithms while providing an order of magnitude speedup on average. Finally, we propose an algorithm that allows for gradient boosted trees training to be distributed across both the data point and feature dimensions. We show that we can achieve communication savings of several orders of magnitude for sparse datasets, compared to existing approaches that can only distribute the computation across the data point dimension and use dense communication.

Ort, förlag, år, upplaga, sidor
KTH Royal Institute of Technology, 2019. s. 120
Serie
TRITA-EECS-AVL ; 2019:43
Nyckelord
Online Learning, Distributed Computing, Graph Similarity, Decision Trees, Gradient Boosting
Nationell ämneskategori
Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-250038 (URN)978-91-7873-181-7 (ISBN)
Disputation
2019-05-28, Sal B, Kistagången 16, våningsplan 2, Electrum 1, KTH Kista, Kista, 14:00 (Engelska)
Opponent
Handledare
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), RIT10-0043Stiftelsen för strategisk forskning (SSF), BD15-0006
Anmärkning

QC 20190426

Tillgänglig från: 2019-04-26 Skapad: 2019-04-25 Senast uppdaterad: 2022-12-12Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltextScopushttps://doi.org/10.1007/s10115-016-0984-2

Person

Vasiloudis, Theodore

Sök vidare i DiVA

Av författaren/redaktören
Vasiloudis, Theodore
I samma tidskrift
Knowledge and Information Systems
Datavetenskap (datalogi)

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 156 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf