kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A comparative study of the Data Warehouse and Data Lakehouse architecture
KTH, School of Electrical Engineering and Computer Science (EECS).
2024 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
En komparativ studie av Data Warehouse- och Data Lakehouse-arkitektur (Swedish)
Abstract [en]

This thesis aimed to assess a given Data Warehouse against a well-suited Data Lakehouse in terms of read performance and scalability. Using the TPC-DS benchmark, these systems were tested with synthetic datasets reflecting the specific needs of a Decision Support (DSS) system. Moreover, this research aimed to determine whether certain categories of queries resulted in notably large discrepancies between the systems. This might help pinpoint the architectural differences that cause these discrepancies. Initial research identified BigQuery and Delta Lake as top candidates due to their exceptional read performance and scalability, prompting further investigation into both. The most significant latency difference was noted in the initial benchmark using a dataset scale of 2 GB, with BigQuery outperforming Delta Lake. As the dataset size grew, BigQuery’s latency increased by 336%, while Delta Lake’s went up by just 40%. However, BigQuery still maintained a significant overall lower latency across all scales. Detailed query analysis showed BigQuery excelling especially with complex queries, those involving extensive aggregation and multiple join operations, which have a high potential for generating large intermediate data during the shuffle stage. It was hypothesized that some of the read performance discrepancies could be attributed to BigQuery’s in-memory shuffling capability, whereas Delta Lake might spill intermediate data to the disk. Delta Lake’s hardware utilization metrics further supported this theory, displaying a trend where peaks in memory usage and disk write rate coincided with queries showing high discrepancies. Meanwhile, CPU utilization remained low. This pattern suggests an I/O-bound system rather than a CPU-bound one, possibly explaining the observed performance differences. Future studies are encouraged to explicitly monitor shuffle operations, aiming for a more rigorous correlation between high-discrepancy queries and data spillage during the shuffle phase. Further research should also include larger dataset sizes; this thesis was constrained to a maximum dataset size of 64 GB due to limited resources.

Abstract [sv]

Denna uppsats undersökte ett givet Data Warehouse i jämförelse med ett lämpligt Data Lakehouse med fokus på läsprestanda och skalbarhet. Med hjälp av TPC-DS benchmark testades dessa system med syntetiska dataset som speglade kundens specifika behov. Vidare syftade forskningen till att avgöra om vissa kategorier av queries resulterade i märkbart stora skillnader mellan systemen. Detta för att identifiera de teknologiska aspekter hos systemen som orsakar dessa skillnader. Den inledande litteraturstudien identifierade BigQuery och Delta Lake som toppkandidater på grund av deras läsprestanda och skalbarhet, vilket ledde till ytterligare undersökning av båda. Den mest påtagliga skillnaden i latens noterades i den initiala jämförelsen med ett dataset av storleken 2 GB, där BigQuery presterade bättre än Delta Lake. När datamängden skalades upp, ökade BigQuery’s latens med 336%, medan Delta Lakes ökade med endast 40%. Dock bibehöll BigQuery en avsevärt lägre total latens för samtliga datamängder. Detaljerad analys visade att BigQuery presterade särskilt bra under komplexa queries som involverade omfattande aggregering och flera join-operationer, vilka har en hög potential för att generera stora datamängder under shuffle-fasen. Det antogs att skillnaderna i latens delvis kunde tillskrivas BigQuery’s in-memory shuffle-kapacitet, medan Delta Lake riskerade att spilla data till disk. Delta Lakes hårdvaruanvändning stödde denna teori ytterligare, där toppar i minnesanvändning och skrivhastighet till disk sammanföll med queries som visade höga skillnader, samtidigt som CPU-användningen förblev låg. Detta mönster tyder på ett I/O-bundet system snarare än ett CPU-bundet, vilket möjligen förklarar de observerade prestandaskillnaderna. Framtida studier uppmuntras att explicit övervaka shuffle-operationer, med målet att mer noggrant koppla queries som uppvisar stora skillnader med dataspill under shuffle-fasen. Ytterligare forskning bör också inkludera större datamängdstorlekar; denna avhandling var begränsad till en maximal datamängdstorlek på 64 GB på grund av begränsade resurser.

Place, publisher, year, edition, pages
2024. , p. 46
Series
TRITA-EECS-EX ; 2024:38
Keywords [en]
Data-Intensive Computing, Data Lakehouse, BigQuery, Delta Lake, Data storage system, Data Lakehouse architecture
Keywords [sv]
Data-intensiv databehandling, Data Lakehouse, Data Lakehouse-arkitektur, BigQuery, Delta Lake, Datalagringssystem
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-345638OAI: oai:DiVA.org:kth-345638DiVA, id: diva2:1851711
Supervisors
Examiners
Available from: 2024-05-07 Created: 2024-04-15 Last updated: 2024-05-07Bibliographically approved

Open Access in DiVA

fulltext(4476 kB)55 downloads
File information
File name FULLTEXT01.pdfFile size 4476 kBChecksum SHA-512
49136930d34e5b2aae4347fb52ead4077c9cf0188edb0b9d3463ee02242b95f04ec8d6e2543ee5557b5206a37caa26ea1376519b01883149b614730d4d7a0716
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 56 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 221 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf