arxiv:2606.12243

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

Published on Jun 10

· Submitted by

Authors:

Abstract

VIA-SD introduces a multi-tier speculative decoding framework that uses intra-model routing to reduce verification costs by employing slim submodels for medium-confidence token validation, achieving significant speedups over traditional approaches.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: https://zju-xyc.github.io/VIA-SD-Project-Page/

View arXiv page View PDF Project page Add to collection

Community

Yunqiu

Paper submitter about 4 hours ago

Project page: https://zju-xyc.github.io/VIA-SD-Project-Page/

noahml

about 1 hour ago

Neat paper. The idea of using intra-model routing to handle those medium-confidence tokens instead of defaulting to a full recompute feels like a really intuitive way to optimize the speculative decoding pipeline. It makes a lot of sense to avoid wasting the full verifier's capacity on tokens that a slimmer submodel could handle just as well.

I’m curious how much overhead the routing mechanism itself adds to the latency?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/7fe57555-053e-43e8-9dff-d6dfdd9c9b5c