Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Knowing an Object by the Company It Keeps: A Domain-Agnostic Scheme for Similarity Discovery
RISE.
RISE.
RISE.
2015 (engelsk)Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Appropriately defining and then efficiently calculating similarities from large data sets are often essential in data mining, both for building tractable representations and for gaining understanding of data and generating processes. Here we rely on the premise that given a set of objects and their correlations, each object is characterized by its context, i.e. its correlations to the other objects, and that the similarity between two objects therefore can be expressed in terms of the similarity between their respective contexts. Resting on this principle, we propose a data-driven and highly scalable approach for discovering similarities from large data sets by representing objects and their relations as a correlation graph that is transformed to a similarity graph. Together these graphs can express rich structural properties among objects. Specifically, we show that concepts -- representations of abstract ideas and notions -- are constituted by groups of similar objects that can be identified by clustering the objects in the similarity graph. These principles and methods are applicable in a wide range of domains, and will here be demonstrated for three distinct types of objects: codons, artists and words, where the numbers of objects and correlations range from small to very large.

sted, utgiver, år, opplag, sider
2015.
HSV kategori
Identifikatorer
URN: urn:nbn:se:kth:diva-250036DOI: 10.1109/ICDM.2015.85ISI: 000380541000013Scopus ID: 2-s2.0-84963516560ISBN: 978-1-4673-9504-5 (tryckt)OAI: oai:DiVA.org:kth-250036DiVA, id: diva2:1307116
Konferanse
2015 IEEE International Conference on Data Mining (ICDM)
Forskningsfinansiär
Swedish Foundation for Strategic Research , RIT10-0043
Merknad

QC 20190426

Tilgjengelig fra: 2019-04-25 Laget: 2019-04-25 Sist oppdatert: 2022-06-26bibliografisk kontrollert
Inngår i avhandling
1. Scalable Machine Learning through Approximation and Distributed Computing
Åpne denne publikasjonen i ny fane eller vindu >>Scalable Machine Learning through Approximation and Distributed Computing
2019 (engelsk)Doktoravhandling, med artikler (Annet vitenskapelig)
Abstract [en]

Machine learning algorithms are now being deployed in practically all areas of our lives. Part of this success can be attributed to the ability to learn complex representations from massive datasets. However, computational speed increases have not kept up with the increase in the sizes of data we want to learn from, leading naturally to algorithms that need to be resource-efficient and parallel. As the proliferation of machine learning continues, the ability for algorithms to adapt to a changing environment and deal with uncertainty becomes increasingly important.

In this thesis we develop scalable machine learning algorithms, with a focus on efficient, online, and distributed computation. We make use of approximations to dramatically reduce the computational cost of exact algorithms, and develop online learning algorithms to deal with a constantly changing environment under a tight computational budget. We design parallel and distributed algorithms to ensure that our methods can scale to massive datasets.

We first propose a scalable algorithm for graph vertex similarity calculation and concept discovery. We demonstrate its applicability to multiple domains, including text, music, and images, and demonstrate its scalability by training on one of the largest text corpora available. Then, motivated by a real-world use case of predicting the session length in media streaming, we propose improvements to several aspects of learning with decision trees. We propose two algorithms to estimate the uncertainty in the predictions of online random forests. We show that our approach can achieve better accuracy than the state of the art while being an order of magnitude faster. We then propose a parallel and distributed online tree boosting algorithm that maintains the correctness guarantees of serial algorithms while providing an order of magnitude speedup on average. Finally, we propose an algorithm that allows for gradient boosted trees training to be distributed across both the data point and feature dimensions. We show that we can achieve communication savings of several orders of magnitude for sparse datasets, compared to existing approaches that can only distribute the computation across the data point dimension and use dense communication.

sted, utgiver, år, opplag, sider
KTH Royal Institute of Technology, 2019. s. 120
Serie
TRITA-EECS-AVL ; 2019:43
Emneord
Online Learning, Distributed Computing, Graph Similarity, Decision Trees, Gradient Boosting
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-250038 (URN)978-91-7873-181-7 (ISBN)
Disputas
2019-05-28, Sal B, Kistagången 16, våningsplan 2, Electrum 1, KTH Kista, Kista, 14:00 (engelsk)
Opponent
Veileder
Forskningsfinansiär
Swedish Foundation for Strategic Research , RIT10-0043Swedish Foundation for Strategic Research , BD15-0006
Merknad

QC 20190426

Tilgjengelig fra: 2019-04-26 Laget: 2019-04-25 Sist oppdatert: 2022-12-12bibliografisk kontrollert

Open Access i DiVA

Fulltekst mangler i DiVA

Andre lenker

Forlagets fulltekstScopushttps://doi.org/10.1109/ICDM.2015.85

Person

Vasiloudis, Theodore

Søk i DiVA

Av forfatter/redaktør
Vasiloudis, Theodore

Søk utenfor DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric

doi
isbn
urn-nbn
Totalt: 133 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf