
LLM Zeroth-Order Fine-Tuning is an Inference Workload
Researchers have identified a fundamental systems mismatch in how zeroth-order fine-tuning for large language models is currently executed. Rather than running ZO algorithms through training infrastructure, the work demonstrates that these methods are inference-dominated and should be routed through serving runtimes like vLLM. On OPT-13B, this architectural shift cuts fine-tuning time by over 8x, from 4.15 hours to 0.51 hours. The finding reshapes how practitioners should think about parameter-efficient adaptation, collapsing the boundary between inference and fine-tuning workloads and opening efficiency gains across the LLM stack.62






















