kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.ORCID iD: 0000-0002-1256-1070
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.ORCID iD: 0000-0002-9675-9729
2025 (English)In: Proceedings of the 2025 The 5th Workshop On Machine Learning And Systems, EUROMLSYS 2025, Association for Computing Machinery (ACM) , 2025, p. 132-138Conference paper, Published paper (Refereed)
Abstract [en]

Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference systems employ iteration-level first-come-first-served scheduling, causing head-of-line blocking when BE jobs delay LS jobs. We introduce QLLM, a novel inference system designed for Mixture of Experts (MoE) models, featuring a fine-grained, priority-aware preemptive scheduler. QLLM enables expert-level preemption, deferring BE job execution while minimizing LS time-to-first-token (TTFT). Our approach removes iteration-level scheduling constraints, enabling the scheduler to preempt jobs at any layer based on priority. Evaluations on an Nvidia A100 GPU show that QLLM significantly improves performance. It reduces LS TTFT by an average of 65.5x and meets the SLO at up to 7 requests/sec, whereas the baseline fails to do so under the tested workload. Additionally, it cuts LS turnaround time by up to 12.8x without impacting throughput. QLLM is modular, extensible, and seamlessly integrates with Hugging Face MoE models.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM) , 2025. p. 132-138
Keywords [en]
Large Language Models, Mixture-of-Experts, Preemptive Scheduling, Latency-Sensitive Inference, GPU Acceleration, Priority-Aware Scheduling
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:kth:diva-364688DOI: 10.1145/3721146.3721956ISI: 001477868300014Scopus ID: 2-s2.0-105003631563ISBN: 979-8-4007-1538-9 (print)OAI: oai:DiVA.org:kth-364688DiVA, id: diva2:1980009
Conference
5th Workshop on Machine Learning and Systems-EUROMLSYS-Annual, MAR 30-APR 03, 2025, Rotterdam, NETHERLANDS
Note

QC 20250701

Available from: 2025-07-01 Created: 2025-07-01 Last updated: 2026-01-15Bibliographically approved

Open Access in DiVA

fulltext(1192 kB)9 downloads
File information
File name FULLTEXT01.pdfFile size 1192 kBChecksum SHA-512
547d4698cfbdc001a80627a87087841910a17a74c14521989c5328673865c9f49e1f14900544723f1aabe6f819eb78d217c4a086a16665da524457715d65eeec
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Siavashi, MohammadKostic, DejanChiesa, Marco

Search in DiVA

By author/editor
Siavashi, MohammadKostic, DejanChiesa, Marco
By organisation
Software and Computer systems, SCS
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 9 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 149 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf