kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Highly Available Task Scheduling in Distinctly Branched Directed Acyclic Graphs
KTH, School of Electrical Engineering and Computer Science (EECS).
2023 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Högt tillgänglig schemaläggning av uppgifter i distinkt grenade riktade acykliska grafer (Swedish)
Abstract [en]

Big data processing frameworks utilizing distributed frameworks to parallelize the computing of datasets have become a staple part of the data engineering and data science pipelines. One of the more known frameworks is Dask, a widely utilized distributed framework used for parallelizing data processing jobs. In Dask, the main component that traverses and plans out the execution of the job is the scheduler. Dask utilizes a centralized scheduling approach, having a single server node as the scheduler. With no failover mechanism implemented for the scheduler, the work in progress is potentially lost if the scheduler fails. As a consequence, jobs that might have been executed for hours or longer need to be restarted. In this thesis, a highly available scheduler is designed, based on Dask. We introduce a highly-available scheduler that replicates the state of the job on a distributed key-value store. The replicated schedulers allow us to design an architecture where the schedulers are able to take over the job in case of a scheduler failure. To reduce the performance overhead of replication, we further explore optimizations based on partitioning typical task graphs and sending each partition to its own scheduler. The results show that the replicated scheduler is able to tolerate server failures and is able to complete the job without restarting but at a cost of reduced throughput due to the replication. This is mitigated by our partitioning, which achieves almost linear performance gains relative to our baseline fault-tolerant scheduler, through the utilization of a parallelized scheduling architecture.

Abstract [sv]

Dataprocesseringsramverk av stora datamängder har blivit en viktig del inom Data Engineering och Data Science pipelines. Ett av de mer kända ramverken är Dask som används för att parallelisera jobb inom data processering. En av huvudkomponenterna i Dask är dess schemaläggare som traverserar och planerar exekveringen av av arbete. Dask använder en centraliserad schemaläggning, med en enda server nod som schemaläggare. Utan en implementerad felhanteringsmekanism innebär det att allt arbete är förlorat ifall schemaläggaren kraschar. I denna uppsats så skapar vi en schemaläggare baserad på Dask. Vi introducerar hög tillgänglighet till schemaläggaren genom att replikera statusen av ett jobb till en distribuerad Key-Value store. För att reducera kostnaden av replikationen så utforskas optimeringar genom att partitionera typiska uppgifts-grafer för att sedan skicka dem till varsin schemaläggare. Resultaten visar att en replikerad schemaläggare tolererar att schemaläggningsservarna kraschar, och att den kan slutföra ett jobb utan att behöva starta om, på en kostnad av reducerad schemaläggningseffektivitet på grund av replikationen. Denna reduktion av effektivitet mitigeras av vår partitioningsstrategi, som genom att använda en paralliserad schemaläggningsarkitektur, uppnår nästan linjära prestandaökningar jämfört med den simpla feltoleranta schemaläggaren.

Place, publisher, year, edition, pages
2023. , p. 53
Series
TRITA-EECS-EX ; 2023:702
Keywords [en]
Distributed Scheduling, Fault-tolerance, Graph Partitioning, Task Graphs, Dask, Dask Distributed, Data Processing
Keywords [sv]
Distribuerad Schemaläggning, Feltolerans, Grafpartitionering, Uppgiftsgrafer, Dask, Dask Distributed, Dataprocessering
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-338090OAI: oai:DiVA.org:kth-338090DiVA, id: diva2:1804922
External cooperation
Rebase
Supervisors
Examiners
Available from: 2023-11-02 Created: 2023-10-13 Last updated: 2023-11-02Bibliographically approved

Open Access in DiVA

fulltext(719 kB)105 downloads
File information
File name FULLTEXT01.pdfFile size 719 kBChecksum SHA-512
13f05d9134fd6517c477d2897ce5147f24d31e749257398ed0457d2779b5d0c7e88655c13087919683de148bd1cf28da56c6896fb6b35f47253ce563da60e637
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 105 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 345 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf