kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Regularized impurity reduction: accurate decision trees with complexity guarantees
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Theoretical Computer Science, TCS.ORCID iD: 0000-0002-1252-7489
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Theoretical Computer Science, TCS.ORCID iD: 0000-0002-5211-112X
2022 (English)In: Data mining and knowledge discovery, ISSN 1384-5810, E-ISSN 1573-756X, p. 1-42Article in journal (Refereed) Published
Abstract [en]

Decision trees are popular classification models, providing high accuracy and intuitive explanations. However, as the tree size grows the model interpretability deteriorates. Traditional tree-induction algorithms, such as C4.5 and CART, rely on impurity-reduction functions that promote the discriminative power of each split. Thus, although these traditional methods are accurate in practice, there has been no theoretical guarantee that they will produce small trees. In this paper, we justify the use of a general family of impurity functions, including the popular functions of entropy and Gini-index, in scenarios where small trees are desirable, by showing that a simple enhancement can equip them with complexity guarantees. We consider a general setting, where objects to be classified are drawn from an arbitrary probability distribution, classification can be binary or multi-class, and splitting tests are associated with non-uniform costs. As a measure of tree complexity, we adopt the expected cost to classify an object drawn from the input distribution, which, in the uniform-cost case, is the expected number of tests. We propose a tree-induction algorithm that gives a logarithmic approximation guarantee on the tree complexity. This approximation factor is tight up to a constant factor under mild assumptions. The algorithm recursively selects a test that maximizes a greedy criterion defined as a weighted sum of three components. The first two components encourage the selection of tests that improve the balance and the cost-efficiency of the tree, respectively, while the third impurity-reduction component encourages the selection of more discriminative tests. As shown in our empirical evaluation, compared to the original heuristics, the enhanced algorithms strike an excellent balance between predictive accuracy and tree complexity.

Place, publisher, year, edition, pages
Springer Nature , 2022. p. 1-42
Keywords [en]
Decision trees, Impurity functions, Submodularity, Tree complexity, Approximation algorithms
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-322391DOI: 10.1007/s10618-022-00884-7ISI: 000889424300001PubMedID: 36618773Scopus ID: 2-s2.0-85142879601OAI: oai:DiVA.org:kth-322391DiVA, id: diva2:1718867
Note

QC 20221214

Available from: 2022-12-13 Created: 2022-12-13 Last updated: 2025-01-27Bibliographically approved
In thesis
1. Optimization beyond a single submodular function: Submodular optimization for ranking, decision trees and diversity
Open this publication in new window or tab >>Optimization beyond a single submodular function: Submodular optimization for ranking, decision trees and diversity
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Submodular functions characterize mathematically the ubiquitous ``diminishing-returns'’ property. They are widely used to describe core subjects in numerous applications, including economic utility, redundancy in information, spread of influence in social networks, and more. Optimization of submodular functions thus plays a central role in a broad range of applications in data science. However, existing works concentrate largely on optimizing a single submodular function, with a sole focus on succinct subset selection among massive data. Other circumstances have not been fully explored, for example, when there exists interplay between a submodular function and other components.

In this thesis, we study optimization beyond a single (non-decreasing) submodular function in the following three areas.

  • Ranking: Ranking arises naturally in the presence of multiple submodular functions, for instance, when a list of shared resources is sequentially served to satisfy multiple submodular demands with individual budgets. One concrete example is ranking web pages to satisfy multiple user intents with different ``patience.'' We first study this ranking problem under cardinality constraints, and then we consider extensions of the basic setting, including knapsack constraints, streaming settings, and robustness requirements.
  • Decision trees: A tree (or a policy) can be seen as an adaptive generalization of ranking. Popular decision trees for classification, including CART and C4.5, are constructed by recursive greedy splits, guided by some impurity function. These trees are accurate, but may not be easy to understand due to a potentially large tree size. A novel characterization of a decision tree is via adaptive intersection among multiple submodular functions, one for each object to be classified. We discover that this characterization enables one to analyze and control the tree complexity for a large family of impurity functions. As a consequence, we are able to produce accurate and interpretable decision trees with bounded complexity.
  • Diversity: Multi-objective optimization is frequently encountered in reality. One specific important form is a joint optimization of utility and diversity, where utility is often modeled by a submodular function. From the theoretical side, we discuss new techniques for such joint optimization over clustered input; from the practical side, we propose one novel application in learning diverse rule sets, where diversity encourages non-overlapping rules that deliver desirable unambiguous decisions.    

For each of the three areas discussed above, we introduce novel formulations for applications in data science, and devise algorithms that are efficient, easy to implement, and have provable approximation guarantees. In addition to theory, we also highlight their practical potential by presenting empirical performance in selected applications. Our work significantly enhances the modeling capabilities of submodular optimization for novel application domains.

Abstract [sv]

Submodulära funktioner karakteriserar matematiskt den allestädes närvarande egenskapen ``minskande avkastning''. De används ofta för att beskriva kärnämnen i många tillämpningar, inklusive ekonomisk nytta, redundans i information, spridning av inflytande i sociala nätverk och mer. Optimering av submodulära funktioner spelar således en central roll i ett brett spektrum av tillämpningar inom datavetenskap. Befintliga arbeten koncentrerar sig dock till stor del på att optimera en enda submodulär funktion, med enbart fokus på kortfattat urval av delmängder bland massiva data. Andra omständigheter har inte undersökts fullt ut, till exempel när det finns ett samspel mellan en submodulär funktion och andra komponenter.

I denna avhandling studerar vi optimering bortom en enda (icke-minskande) submodulär funktion inom följande tre områden.

  • Ranking: Rangordning uppstår naturligt i närvaro av flera submodulära funktioner, till exempel när en lista med delade resurser serveras sekventiellt för att tillfredsställa flera submodulära krav med individuella budgetar. Ett konkret exempel är att rangordna webbsidor för att tillfredsställa flera användares avsikter med olika ``tålamod''. Vi studerar först detta rankningsproblem under kardinalitetsbegränsningar, och sedan överväger vi utökningar av den grundläggande inställningen, inklusive ryggsäcksbegränsningar, strömningsinställningar och robusthetskrav.
  • Beslutsträd: Ett träd (eller en policy) kan ses som en adaptiv generalisering av rangordning. Populära beslutsträd för klassificering, inklusive CART och C4.5, är konstruerade av rekursiva giriga splittringar, styrda av någon orenhetsfunktion. Dessa träd är korrekta, men kanske inte är lätta att förstå på grund av en potentiellt stor trädstorlek. En ny karaktärisering av ett beslutsträd sker via adaptiv skärningspunkt mellan flera submodulära funktioner, en för varje objekt som ska klassificeras. Vi upptäcker att denna karakterisering gör det möjligt för en att analysera och kontrollera trädets komplexitet för en stor familj av föroreningsfunktioner.Som en konsekvens kan vi producera korrekta och tolkningsbara beslutsträd med begränsad komplexitet.
  • Mångfald: Multi-objektiv optimering påträffas ofta i verkligheten. En specifik viktig form är en gemensam optimering av nytta och mångfald, där nytta ofta modelleras av en submodulär funktion. Från den teoretiska sidan diskuterar vi nya tekniker för sådan gemensam optimering över klustrad input; Från den praktiska sidan föreslår vi en ny tillämpning för att lära sig olika regeluppsättningar, där mångfald uppmuntrar icke-överlappande regler som ger önskvärda entydiga beslut.

För vart och ett av de tre områden som diskuterats ovan, vi introducerar nya formuleringar för tillämpningar inom datavetenskap och utvecklar algoritmer som är effektiva, lätta att implementera och har bevisbara approximationsgarantier. Utöver teori lyfter vi också fram deras praktiska potential genom att presentera empiriska prestationer i utvalda tillämpningar. Vårt arbete förbättrar avsevärt modelleringsmöjligheterna för submodulär optimering för nya applikationsdomäner.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2022. p. 232
Series
TRITA-EECS-AVL ; 2023:8
Keywords
Submodular optimization, Approximation algorithms, Ranking, Decision trees, Diversity
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-322458 (URN)978-91-8040-459-4 (ISBN)
Public defence
2023-02-15, Zoom: https://kth-se.zoom.us/j/69024442639, F3, Lindstedtsvägen 26, Stockholm, 13:30 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP), 70101
Note

QC 20221220

Available from: 2022-12-16 Created: 2022-12-15 Last updated: 2025-01-27Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Authority records

Zhang, GuangyiGionis, Aristides

Search in DiVA

By author/editor
Zhang, GuangyiGionis, Aristides
By organisation
Theoretical Computer Science, TCS
In the same journal
Data mining and knowledge discovery
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 48 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf