kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Optimization beyond a single submodular function: Submodular optimization for ranking, decision trees and diversity
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Theoretical Computer Science, TCS.ORCID iD: 0000-0002-1252-7489
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Submodular functions characterize mathematically the ubiquitous ``diminishing-returns'’ property. They are widely used to describe core subjects in numerous applications, including economic utility, redundancy in information, spread of influence in social networks, and more. Optimization of submodular functions thus plays a central role in a broad range of applications in data science. However, existing works concentrate largely on optimizing a single submodular function, with a sole focus on succinct subset selection among massive data. Other circumstances have not been fully explored, for example, when there exists interplay between a submodular function and other components.

In this thesis, we study optimization beyond a single (non-decreasing) submodular function in the following three areas.

  • Ranking: Ranking arises naturally in the presence of multiple submodular functions, for instance, when a list of shared resources is sequentially served to satisfy multiple submodular demands with individual budgets. One concrete example is ranking web pages to satisfy multiple user intents with different ``patience.'' We first study this ranking problem under cardinality constraints, and then we consider extensions of the basic setting, including knapsack constraints, streaming settings, and robustness requirements.
  • Decision trees: A tree (or a policy) can be seen as an adaptive generalization of ranking. Popular decision trees for classification, including CART and C4.5, are constructed by recursive greedy splits, guided by some impurity function. These trees are accurate, but may not be easy to understand due to a potentially large tree size. A novel characterization of a decision tree is via adaptive intersection among multiple submodular functions, one for each object to be classified. We discover that this characterization enables one to analyze and control the tree complexity for a large family of impurity functions. As a consequence, we are able to produce accurate and interpretable decision trees with bounded complexity.
  • Diversity: Multi-objective optimization is frequently encountered in reality. One specific important form is a joint optimization of utility and diversity, where utility is often modeled by a submodular function. From the theoretical side, we discuss new techniques for such joint optimization over clustered input; from the practical side, we propose one novel application in learning diverse rule sets, where diversity encourages non-overlapping rules that deliver desirable unambiguous decisions.    

For each of the three areas discussed above, we introduce novel formulations for applications in data science, and devise algorithms that are efficient, easy to implement, and have provable approximation guarantees. In addition to theory, we also highlight their practical potential by presenting empirical performance in selected applications. Our work significantly enhances the modeling capabilities of submodular optimization for novel application domains.

Abstract [sv]

Submodulära funktioner karakteriserar matematiskt den allestädes närvarande egenskapen ``minskande avkastning''. De används ofta för att beskriva kärnämnen i många tillämpningar, inklusive ekonomisk nytta, redundans i information, spridning av inflytande i sociala nätverk och mer. Optimering av submodulära funktioner spelar således en central roll i ett brett spektrum av tillämpningar inom datavetenskap. Befintliga arbeten koncentrerar sig dock till stor del på att optimera en enda submodulär funktion, med enbart fokus på kortfattat urval av delmängder bland massiva data. Andra omständigheter har inte undersökts fullt ut, till exempel när det finns ett samspel mellan en submodulär funktion och andra komponenter.

I denna avhandling studerar vi optimering bortom en enda (icke-minskande) submodulär funktion inom följande tre områden.

  • Ranking: Rangordning uppstår naturligt i närvaro av flera submodulära funktioner, till exempel när en lista med delade resurser serveras sekventiellt för att tillfredsställa flera submodulära krav med individuella budgetar. Ett konkret exempel är att rangordna webbsidor för att tillfredsställa flera användares avsikter med olika ``tålamod''. Vi studerar först detta rankningsproblem under kardinalitetsbegränsningar, och sedan överväger vi utökningar av den grundläggande inställningen, inklusive ryggsäcksbegränsningar, strömningsinställningar och robusthetskrav.
  • Beslutsträd: Ett träd (eller en policy) kan ses som en adaptiv generalisering av rangordning. Populära beslutsträd för klassificering, inklusive CART och C4.5, är konstruerade av rekursiva giriga splittringar, styrda av någon orenhetsfunktion. Dessa träd är korrekta, men kanske inte är lätta att förstå på grund av en potentiellt stor trädstorlek. En ny karaktärisering av ett beslutsträd sker via adaptiv skärningspunkt mellan flera submodulära funktioner, en för varje objekt som ska klassificeras. Vi upptäcker att denna karakterisering gör det möjligt för en att analysera och kontrollera trädets komplexitet för en stor familj av föroreningsfunktioner.Som en konsekvens kan vi producera korrekta och tolkningsbara beslutsträd med begränsad komplexitet.
  • Mångfald: Multi-objektiv optimering påträffas ofta i verkligheten. En specifik viktig form är en gemensam optimering av nytta och mångfald, där nytta ofta modelleras av en submodulär funktion. Från den teoretiska sidan diskuterar vi nya tekniker för sådan gemensam optimering över klustrad input; Från den praktiska sidan föreslår vi en ny tillämpning för att lära sig olika regeluppsättningar, där mångfald uppmuntrar icke-överlappande regler som ger önskvärda entydiga beslut.

För vart och ett av de tre områden som diskuterats ovan, vi introducerar nya formuleringar för tillämpningar inom datavetenskap och utvecklar algoritmer som är effektiva, lätta att implementera och har bevisbara approximationsgarantier. Utöver teori lyfter vi också fram deras praktiska potential genom att presentera empiriska prestationer i utvalda tillämpningar. Vårt arbete förbättrar avsevärt modelleringsmöjligheterna för submodulär optimering för nya applikationsdomäner.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2022. , p. 232
Series
TRITA-EECS-AVL ; 2023:8
Keywords [en]
Submodular optimization, Approximation algorithms, Ranking, Decision trees, Diversity
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-322458ISBN: 978-91-8040-459-4 (print)OAI: oai:DiVA.org:kth-322458DiVA, id: diva2:1719648
Public defence
2023-02-15, Zoom: https://kth-se.zoom.us/j/69024442639, F3, Lindstedtsvägen 26, Stockholm, 13:30 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP), 70101
Note

QC 20221220

Available from: 2022-12-16 Created: 2022-12-15 Last updated: 2025-01-27Bibliographically approved
List of papers
1. Ranking with submodular functions on a budget
Open this publication in new window or tab >>Ranking with submodular functions on a budget
2022 (English)In: Data mining and knowledge discovery, ISSN 1384-5810, E-ISSN 1573-756X, Vol. 36, no 3, p. 1197-1218Article in journal (Refereed) Published
Abstract [en]

Submodular maximization has been the backbone of many important machine-learning problems, and has applications to viral marketing, diversification, sensor placement, and more. However, the study of maximizing submodular functions has mainly been restricted in the context of selecting a set of items. On the other hand, many real-world applications require a solution that is a ranking over a set of items. The problem of ranking in the context of submodular function maximization has been considered before, but to a much lesser extent than item-selection formulations. In this paper, we explore a novel formulation for ranking items with submodular valuations and budget constraints. We refer to this problem as max-submodular ranking (MSR). In more detail, given a set of items and a set of non-decreasing submodular functions, where each function is associated with a budget, we aim to find a ranking of the set of items that maximizes the sum of values achieved by all functions under the budget constraints. For the MSR problem with cardinality- and knapsack-type budget constraints we propose practical algorithms with approximation guarantees. In addition, we perform an empirical evaluation, which demonstrates the superior performance of the proposed algorithms against strong baselines.

Place, publisher, year, edition, pages
Springer Science and Business Media LLC, 2022
Keywords
Ranking, Submodular maximization, Dynamic programming, Approximation algorithms
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-322390 (URN)10.1007/s10618-022-00833-4 (DOI)000785926000001 ()35601821 (PubMedID)2-s2.0-85128736051 (Scopus ID)
Note

QC 20221214

Available from: 2022-12-13 Created: 2022-12-13 Last updated: 2025-01-27Bibliographically approved
2. Regularized impurity reduction: accurate decision trees with complexity guarantees
Open this publication in new window or tab >>Regularized impurity reduction: accurate decision trees with complexity guarantees
2022 (English)In: Data mining and knowledge discovery, ISSN 1384-5810, E-ISSN 1573-756X, p. 1-42Article in journal (Refereed) Published
Abstract [en]

Decision trees are popular classification models, providing high accuracy and intuitive explanations. However, as the tree size grows the model interpretability deteriorates. Traditional tree-induction algorithms, such as C4.5 and CART, rely on impurity-reduction functions that promote the discriminative power of each split. Thus, although these traditional methods are accurate in practice, there has been no theoretical guarantee that they will produce small trees. In this paper, we justify the use of a general family of impurity functions, including the popular functions of entropy and Gini-index, in scenarios where small trees are desirable, by showing that a simple enhancement can equip them with complexity guarantees. We consider a general setting, where objects to be classified are drawn from an arbitrary probability distribution, classification can be binary or multi-class, and splitting tests are associated with non-uniform costs. As a measure of tree complexity, we adopt the expected cost to classify an object drawn from the input distribution, which, in the uniform-cost case, is the expected number of tests. We propose a tree-induction algorithm that gives a logarithmic approximation guarantee on the tree complexity. This approximation factor is tight up to a constant factor under mild assumptions. The algorithm recursively selects a test that maximizes a greedy criterion defined as a weighted sum of three components. The first two components encourage the selection of tests that improve the balance and the cost-efficiency of the tree, respectively, while the third impurity-reduction component encourages the selection of more discriminative tests. As shown in our empirical evaluation, compared to the original heuristics, the enhanced algorithms strike an excellent balance between predictive accuracy and tree complexity.

Place, publisher, year, edition, pages
Springer Nature, 2022
Keywords
Decision trees, Impurity functions, Submodularity, Tree complexity, Approximation algorithms
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-322391 (URN)10.1007/s10618-022-00884-7 (DOI)000889424300001 ()36618773 (PubMedID)2-s2.0-85142879601 (Scopus ID)
Note

QC 20221214

Available from: 2022-12-13 Created: 2022-12-13 Last updated: 2026-01-30Bibliographically approved
3. Coresets remembered and items forgotten: submodular maximization with deletions
Open this publication in new window or tab >>Coresets remembered and items forgotten: submodular maximization with deletions
(English)Manuscript (preprint) (Other academic)
Abstract [en]

In recent years we have witnessed an increase on the development of methods for submodular optimization, which have been motivated by the wide applicability of submodular functions in real-world data-science problems. In this paper, we contribute to this line of work by considering the problem of robust submodular maximization against unexpected deletions, which may occur due to privacy issues or user preferences. Specifically, we consider the minimum number of items an algorithm has to remember, in order to achieve a non-trivial approximation guarantee against adversarial deletion of up to d items. We refer to the set of items that an algorithm has to keep before adversarial deletions as a deletion-robust coreset.Our theoretical contributions are two-fold. First, we propose a single-pass streaming algorithm that yields a (1−2ϵ)/(4p)-approximation for maximizing a non-decreasing submodular function under a general p-matroid constraint and requires a coreset of size k+d/ϵ, where k is the maximum size of a feasible solution. To the best of our knowledge, this is the first work to achieve an (asymptotically) optimal coreset, as no constant-factor approximation is possible with a coreset of size sublinear in d. Second, we devise an effective offline algorithm that guarantees stronger approximation ratios with a coreset of size O(dlog(k)/ϵ). We also demonstrate the superior empirical performance of the proposed algorithms in real-life applications.

Keywords
robust optimization, submodular maximization, streaming algorithms, approximation algorithms
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-322389 (URN)
Note

QC 20221214

Available from: 2022-12-13 Created: 2022-12-13 Last updated: 2025-01-27Bibliographically approved
4. Diverse Rule Sets
Open this publication in new window or tab >>Diverse Rule Sets
2020 (English)In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery (ACM) , 2020, p. 1532-1541Conference paper, Published paper (Refereed)
Abstract [en]

While machine-learning models are flourishing and transforming many aspects of everyday life, the inability of humans to understand complex models poses difficulties for these models to be fully trusted and embraced. Thus, interpretability of models has been recognized as an equally important quality as their predictive power. In particular, rule-based systems are experiencing a renaissance owing to their intuitive if-then representation. However, simply being rule-based does not ensure interpretability. For example, overlapped rules spawn ambiguity and hinder interpretation. Here we propose a novel approach of inferring diverse rule sets, by optimizing small overlap among decision rules with a 2-approximation guarantee under the framework of Max-Sum diversification. We formulate the problem as maximizing a weighted sum of discriminative quality and diversity of a rule set. In order to overcome an exponential-size search space of association rules, we investigate several natural options for a small candidate set of high-quality rules, including frequent and accurate rules, and examine their hardness. Leveraging the special structure in our formulation, we then devise an efficient randomized algorithm, which samples rules that are highly discriminative and have small overlap. The proposed sampling algorithm analytically targets a distribution of rules that is tailored to our objective. We demonstrate the superior predictive power and interpretability of our model with a comprehensive empirical study against strong baselines. 

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2020
Keywords
classifier, diversification, pattern mining, rule learning, rule sets, sampling, Information management, Decision rules, Empirical studies, Interpretability, Machine learning models, Predictive power, Randomized Algorithms, Sampling algorithm, Special structure, Data mining
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-286486 (URN)10.1145/3394486.3403204 (DOI)000749552301052 ()2-s2.0-85090415803 (Scopus ID)
Conference
26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2020, 23 August 2020 through 27 August 2020
Note

Part of proceedings ISBN 9781450379984

QC 20201217

Available from: 2020-12-17 Created: 2020-12-17 Last updated: 2025-01-27Bibliographically approved
5. Maximizing diversity over clustered data*
Open this publication in new window or tab >>Maximizing diversity over clustered data*
2020 (English)In: Proceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020, Society for Industrial and Applied Mathematics Publications , 2020, p. 649-657Conference paper, Published paper (Refereed)
Abstract [en]

Maximum diversity aims at selecting a diverse set of high-quality objects from a collection, which is a fundamental problem and has a wide range of applications, e.g., in Web search. Diversity under a uniform or partition matroid constraint naturally describes useful cardinality or budget requirements, and admits simple approximation algorithms [5]. When applied to clustered data, however, popular algorithms such as picking objects iteratively and performing local search lose their approximation guarantees towards maximum intra-cluster diversity because they fail to optimize the objective in a global manner. We propose an algorithm that greedily adds a pair of objects instead of a singleton, and which attains a constant approximation factor over clustered data. We further extend the algorithm to the case of monotone and submodular quality function, and under a partition matroid constraint. We also devise a technique to make our algorithm scalable, and on the way we obtain a modification that gives better solutions in practice while maintaining the approximation guarantee in theory. Our algorithm achieves excellent performance, compared to strong baselines in a mix of synthetic and real-world datasets.

Place, publisher, year, edition, pages
Society for Industrial and Applied Mathematics Publications, 2020
Keywords
Approximation algorithms, Budget control, Clustering algorithms, Combinatorial mathematics, Iterative methods, Approximation factor, Cardinalities, Clustered datum, High quality, Intra-cluster, Local search, Partition matroid, Real-world datasets, Data mining
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-285364 (URN)10.1137/1.9781611976236.73 (DOI)000627117200073 ()2-s2.0-85089176616 (Scopus ID)
Conference
2020 SIAM International Conference on Data Mining, SDM 2020; Cincinnati; United States; 7 May 2020 through 9 May 2020
Note

QC 20210427

ISBN:nr 978-1-61197-623-6

Available from: 2020-12-01 Created: 2020-12-01 Last updated: 2025-01-27Bibliographically approved
6. Ranking with submodular functions on the fly
Open this publication in new window or tab >>Ranking with submodular functions on the fly
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Maximizing submodular functions have been studied extensively for a wide range of subset-selection problems.However, much less attention has been given to the role of submodularityin sequence-selection and ranking problems.A recently-introduced framework, named maximum submodular ranking (MSR), tackles a family of ranking problems that arise naturallywhen resources are shared among multiple demands with different budgets.For example, the MSR framework can be used to rank web pages for multiple user intents.In this paper, we extend the MSR framework in the streaming setting. In particular, we consider two different streaming modelsand we propose practical approximation algorithms.In the first streaming model, called function arriving,we assume that submodular functions (demands) arrive continuously in a stream, while in the second model, called item arriving, we assume that items (resources) arrive continuously.Furthermore, we study the MSR problem with additional constraints on the output sequence, such as a matroid constraint that can ensure fair exposure among items from different groups.These extensions significantly broaden the range of problems that can be captured by the MSR framework.On the practical side, we develop several novel applications based on the MSR formulation, and empirically evaluate the performance of the proposed~methods.

Keywords
submodular optimization, streaming algorithms, ranking
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-322388 (URN)
Note

QC 20221214

Available from: 2022-12-13 Created: 2022-12-13 Last updated: 2025-01-27Bibliographically approved

Open Access in DiVA

fulltext(1108 kB)538 downloads
File information
File name FULLTEXT01.pdfFile size 1108 kBChecksum SHA-512
07add6b029bd8ad937603efbe3c6e653626420bc21f07bff11329a1fe318b8915a9cb2f23ece564b40ed4e75cce246a9bc407ba734b4087c155bbd3c6a3a0f4b
Type fulltextMimetype application/pdf

Authority records

Zhang, Guangyi

Search in DiVA

By author/editor
Zhang, Guangyi
By organisation
Theoretical Computer Science, TCS
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 538 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1441 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf