kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Mind the Data, Measuring the Performance Gap Between Tree Ensembles and Deep Learning on Tabular Data
King, Stockholm, Sweden.
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS. KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Centres, Wallenberg Wood Science Center.ORCID iD: 0000-0003-0422-6560
Center for Applied Intelligent Systems Research, Halmstad University, Halmstad, Sweden.
Center for Applied Intelligent Systems Research, Halmstad University, Halmstad, Sweden.
Show others and affiliations
2024 (English)In: Advances in Intelligent Data Analysis XXII - 22nd International Symposium on Intelligent Data Analysis, IDA 2024, Proceedings, Springer Nature , 2024, Vol. 14641, p. 65-76Conference paper, Published paper (Refereed)
Abstract [en]

Recent machine learning studies on tabular data show that ensembles of decision tree models are more efficient and performant than deep learning models such as Tabular Transformer models. However, as we demonstrate, these studies are limited in scope and do not paint the full picture. In this work, we focus on how two dataset properties, namely dataset size and feature complexity, affect the empirical performance comparison between tree ensembles and Tabular Transformer models. Specifically, we employ a hypothesis-driven approach and identify situations where Tabular Transformer models are expected to outperform tree ensemble models. Through empirical evaluation, we demonstrate that given large enough datasets, deep learning models perform better than tree models. This gets more pronounced when complex feature interactions exist in the given task and dataset, suggesting that one must pay careful attention to dataset properties when selecting a model for tabular data in machine learning – especially in an industrial setting, where larger and larger datasets with less and less carefully engineered features are becoming routinely available.

Place, publisher, year, edition, pages
Springer Nature , 2024. Vol. 14641, p. 65-76
Series
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN 0302-9743 ; 14641
Keywords [en]
Gradient boosting, Tabular data, Tabular Transformers
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:kth:diva-346536DOI: 10.1007/978-3-031-58547-0_6ISI: 001295919100006Scopus ID: 2-s2.0-85192227414OAI: oai:DiVA.org:kth-346536DiVA, id: diva2:1858452
Conference
22nd International Symposium on Intelligent Data Analysis, IDA 2024, Stockholm, Sweden, Apr 24 2024 - Apr 26 2024
Note

QC 20240521

Part of ISBN 978-303158546-3

Available from: 2024-05-16 Created: 2024-05-16 Last updated: 2025-02-07Bibliographically approved
In thesis
1. Representation Learning and Parallelization for Machine Learning Applications with Graph, Tabular, and Time-Series Data
Open this publication in new window or tab >>Representation Learning and Parallelization for Machine Learning Applications with Graph, Tabular, and Time-Series Data
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Machine Learning (ML) models have achieved significant success in representation learning across domains like vision, language, graphs, and tabular data. Constructing effective ML models hinges on several critical considerations: (1) data representation: how to represent the input data in a meaningful and effective way; (2) learning objectives: how to define desired prediction target in a specific downstream task; (3) model architecture: which representation learning model architecture, i.e., the type of neural network, is the most appropriate for the given downstream task; (4) training strategy: how to effectively train the selected ML model for better feature extraction and representation quality.

This thesis explores representation learning and parallelization in machine learning, addressing how to boost model accuracy and reduce training time. Our research explores several innovative approaches to improve the efficiency and effectiveness of ML applications on graph, tabular, and time-series data, with contributions to areas such as combinatorial optimization, parallel training, and ML methods across these data types. First, we explore representation learning in combinatorial optimization and integrate a constraint-based exact solver with the predictive ML model to enhance problem-solving efficiency. We demonstrate that combining an exact solver with a predictive model that estimates optimal solution costs significantly reduces the search space and accelerates solution times. Second, we employ graph Transformer models to leverage topological and semantic node similarities in the input data, resulting in superior node representations and improved downstream task performance. Third, we empirically study the choice of model architecture for learning from tabular data. We showcase the application of tabular Transformer models to large datasets, revealing their ability to create high predictive power features. Fourth, we utilize Transformer models for detailed user behavior modeling from time-series data, illustrating their effectiveness in capturing fine-grained patterns. Finally, we dive into the training strategy and investigate graph traversal strategies to improve device placement in deep learning model parallelization, showing that optimized traversal order enhances parallel training speed. Collectively, these findings advance the understanding and application of representation learning and parallelization in diverse ML contexts.

This thesis enhances representation learning and parallelization in ML models, addressing key challenges in representation quality. Our methods advance combinatorial optimization, parallel training, and ML on graph, tabular, and time-series data. Additionally, our findings contribute to understanding Transformer models, leading to more accurate predictions and improved performance across various domains.

Abstract [sv]

Maskininlärningsmodeller (ML) har nått betydande framgångar i representationsinlärning över domäner som datorseende, språk, grafer och tabelldata. Att konstruera effektiva ML-modeller beror på flera kritiska överväganden. (1) datarepresentation: Hur indata representeras på ett meningsfullt och effektivt; (2) lärandemål: hur önskat förutsägelsemål defineras i en specifik nedströmsuppgift; (3) modellarkitektur: vilken modellarkitektur för representationsinlärning, d.v.s. typen av neurala nätverk, är den mest lämpliga för den givna nedströmsuppgiften; (4) träningsstrategi: hur man effektivt tränar den valda ML-modellen för bättre funktionsextraktion och representationskvalitet.

Denna avhandling utforskar representationsinlärning och parallellisering i maskininlärning, och tar upp hur man kan öka modellnoggrannheten och minska träningstiden. Vår forskning undersöker flera innovativa tillvägagångssätt för att förbättra effektiviteten hos ML på graf-, tabell- och tidsseriedata. Först utforskar vi representationsinlärning i kombinatorisk optimering och integrerar en villkorsprogrammeringsbaserad exakt lösare med den prediktiva ML-modellen för att förbättra problemlösningseffektiviteten. Vi visar att kombinationen av en exakt lösare med en prediktiv modell som uppskattar optimala lösningskostnader minskar sökutrymmet och lösningstiderna avsevärt. För det andra använder vi Transformer-modeller för grafer för att utnyttja topologiska och semantiska nodlikheter i indata, vilket resulterar i överlägsna nodrepresentationer och förbättrad prestanda för nedströms uppgifter. För det tredje studerar vi empiriskt valet av modellarkitektur för att lära av tabelldata. Vi visar upp tillämpningen av tabellformade transformatormodeller på stora datamängder, och avslöjar deras förmåga att skapa funktioner med hög prediktiv kraft. För det fjärde använder vi transformatormodeller för detaljerad modellering av användarbeteende från tidsseriedata, som illustrerar deras effektivitet när det gäller att fånga finkorniga mönster. Slutligen fördjupar vi oss i träningsstrategin och undersöker grafövergångsstrategier för att förbättra enhetsplacering i parallellisering av djupinlärningsmodeller, vilket visar att optimerad genomgångsordning förbättrar parallell träningshastighet. Tillsammans främjar dessa fynd förståelsen och tillämpningen av representationsinlärning och parallellisering i olika MLsammanhang.

Denna avhandling förbättrar representationsinlärning och parallellisering i ML-modeller, och tar upp viktiga utmaningar i representationskvalitet. Våra metoder främjar kombinatorisk optimering, parallell träning och ML på graf-, tabell- och tidsseriedata. Dessutom bidrar våra resultat till att förstå transformatormodeller, vilket leder till mer exakta förutsägelser och förbättrad prestanda över olika domäner.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2024. p. 55
Series
TRITA-EECS-AVL ; 2024:72
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-353825 (URN)978-91-8106-051-5 (ISBN)
Public defence
2024-10-21, Sal C, Electrum, Kistagången 16, https://kth-se.zoom.us/s/63322131109, Stockholm, 09:00 (English)
Opponent
Supervisors
Available from: 2024-09-24 Created: 2024-09-24 Last updated: 2024-09-24Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Wang, Tianze

Search in DiVA

By author/editor
Wang, Tianze
By organisation
Software and Computer systems, SCSWallenberg Wood Science Center
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 99 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf