Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference
2025 (English)In: Proceedings of the 2025 The 5th Workshop On Machine Learning And Systems, EUROMLSYS 2025, Association for Computing Machinery (ACM) , 2025, p. 132-138Conference paper, Published paper (Refereed)
Abstract [en]
Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference systems employ iteration-level first-come-first-served scheduling, causing head-of-line blocking when BE jobs delay LS jobs. We introduce QLLM, a novel inference system designed for Mixture of Experts (MoE) models, featuring a fine-grained, priority-aware preemptive scheduler. QLLM enables expert-level preemption, deferring BE job execution while minimizing LS time-to-first-token (TTFT). Our approach removes iteration-level scheduling constraints, enabling the scheduler to preempt jobs at any layer based on priority. Evaluations on an Nvidia A100 GPU show that QLLM significantly improves performance. It reduces LS TTFT by an average of 65.5x and meets the SLO at up to 7 requests/sec, whereas the baseline fails to do so under the tested workload. Additionally, it cuts LS turnaround time by up to 12.8x without impacting throughput. QLLM is modular, extensible, and seamlessly integrates with Hugging Face MoE models.
Place, publisher, year, edition, pages
Association for Computing Machinery (ACM) , 2025. p. 132-138
Keywords [en]
Large Language Models, Mixture-of-Experts, Preemptive Scheduling, Latency-Sensitive Inference, GPU Acceleration, Priority-Aware Scheduling
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:kth:diva-364688DOI: 10.1145/3721146.3721956ISI: 001477868300014Scopus ID: 2-s2.0-105003631563ISBN: 979-8-4007-1538-9 (print)OAI: oai:DiVA.org:kth-364688DiVA, id: diva2:1980009
Conference
5th Workshop on Machine Learning and Systems-EUROMLSYS-Annual, MAR 30-APR 03, 2025, Rotterdam, NETHERLANDS
Note
QC 20250701
2025-07-012025-07-012026-01-15Bibliographically approved