Skip to content

feat: vLLM proxy backend — serve VLM fine-tunes via vLLM with dynamic LoRA#2434

Open
hansent wants to merge 15 commits into
mainfrom
feat/vllm-proxy-backend
Open

feat: vLLM proxy backend — serve VLM fine-tunes via vLLM with dynamic LoRA#2434
hansent wants to merge 15 commits into
mainfrom
feat/vllm-proxy-backend

Conversation

@hansent

@hansent hansent commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

What

New inference/models/vllm_proxy/ package: the inference server runs CPU-only in front of a vLLM container that owns the GPU and does continuous batching. Per-request auth/billing/model-resolution/preprocessing stay in the inference server unchanged — batching happens below the per-request HTTP surface, so @usage_collector and the auth middleware need zero changes.

  • qwen_vllm_base.py + family classes (qwen3_5, qwen3vl): Model adapters with preprocess/postprocess parity to the HF path (think-tag handling parity-tested against Qwen35HF).
  • adapter_manager.py: model_id → registry resolution → adapter-only download (~MBs; base weights never re-downloaded) → patch → runtime /v1/load_lora_adapter → LRU eviction. Adapter identity = model_id + package_id + content digest (package ids are not unique per version).
  • adapter_patch.py: deterministic transform — key remapping, vision-tower filtering, DoRA policies (reject / strip / SVD-convert; strip validated byte-exact on one production adapter class, ~0.91-similar on a denser class → per-adapter accuracy gate is the admission rule). SVD merge math verified against PEFT's own merge_and_unload.
  • File-authoritative base matching: registry modelVariant proved misregistered on real models (image-text/223 et al.); the adapter's own adapter_config.json base_model_name_or_path is the authority, registry variant is advisory with a misregistration WARN + patch_report record.
  • Observability: all vLLM calls carry X-Request-Id from the execution_id/correlation contextvars; vLLM 5xx on adapter load wraps into typed errors naming slug/model/response.

Enabled per pool via VLLM_PROXY_ENABLED (selection switch in inference/models/utils.py); zero behavior change otherwise.

Validation (cru-staging, full serverless path)

  • qwen3vl-2b fine-tune: 38 RPS @ c64 (baseline 1.2 RPS, ~32×), zero errors; two adapters mixed: 40.9 RPS (no multi-LoRA penalty); qwen3_5-0.8b/2b pools: 25 / 22 RPS @ c32.
  • Cold dynamic-LoRA path: resolve → download → strip → load in ~2.4s.
  • 140 unit tests (CPU-only, mocked vLLM).

Companion PRs: async-serverless (pools + routing), roboflow-infra (vLLM image + DoRA lab), inference (micro-batching, separate branch).

hansent added 4 commits June 9, 2026 14:10
New inference/models/vllm_proxy package: the inference server runs
CPU-only in front of a vLLM sidecar (OpenAI-compatible API) that owns
the GPU and does continuous batching. Per-request auth/billing/model
resolution/preprocessing stay in the inference server unchanged.

- vllm_client: chat completions + runtime LoRA load/unload
- adapter_manager: model_id -> Roboflow package resolution, adapter-only
  download, identity = model_id + package_id + digest, LRU eviction
- adapter_patch: key remap (model.layers -> language_model path),
  vision-tower filtering, DoRA policies (reject/strip/svd) incl.
  exact-delta SVD conversion verified against peft merge math
- qwen3_5_vllm: Qwen35VLLMProxy with preprocess/postprocess parity to
  the HF path (think-tag parsing parity-tested against Qwen35HF)

Enabled per pool via VLLM_PROXY_ENABLED; selection switch in
inference/models/utils.py.
Generalize the proxy into QwenVLLMProxyBase with family-specific knobs;
add Qwen3VLVLLMProxy (no thinking mode, qwen3vl pixel budget/prompts).
Adapter manager base-variant matching normalized to
<architecture>-<variant minus -peft> and served-name short-circuit.
Registration switch extended to qwen3vl ids behind VLLM_PROXY_ENABLED.
…correlation

Real incident: registry metadata for image-text/223 says 0.8b-peft but
the adapter's own config says base qwen3_5-2b — the mislabeled adapter
reached vLLM and died with an opaque tensor-shape 500. Now:
- patch_adapter cross-checks adapter_config base_model_name_or_path
  against the pool's served base and rejects with a message naming both
  values (and that the registry record is the thing to fix)
- vLLM 5xx on load_lora_adapter wraps into a typed 501 with the slug,
  model id, and response excerpt; connection errors stay retryable
- all vLLM API calls carry X-Request-Id from the execution_id /
  correlation contextvars so vLLM logs correlate with platform logs
…registry variant advisory

Registry modelVariant is sometimes misregistered while the adapter's
own adapter_config.json is always correct (written with the weights).
The manager no longer rejects on variant mismatch — it defers to the
post-download cross-check against base_model_name_or_path, and logs a
drift WARN when the registry disagrees (passive misregistration audit,
also recorded in patch_report.json). Architecture gate stays pre-download.
hansent added 2 commits June 10, 2026 10:50
The vLLM pools run 64+ concurrent proxied requests per pod; the anyio
default of 40 threads silently caps sync-handler concurrency below the
consumer's permit count. Identical implementation as on
feat/vlm-dynamic-batching so the branches merge cleanly.
@hansent hansent marked this pull request as draft June 10, 2026 15:58
hansent added 2 commits June 10, 2026 13:06
… self-heal

Multi-worker correctness (NUM_WORKERS>1: one shared vLLM engine, one
AdapterManager per process): never auto-unload adapters (vLLM's
--max-cpu-loras LRU owns memory and refills from disk), always re-issue
the idempotent load call at instance creation even when the local map
says registered (it may be stale), and on an unknown-LoRA 404 in the
request path invalidate + re-register + retry exactly once.
@hansent hansent marked this pull request as ready for review June 16, 2026 13:43

@dkosowski87 dkosowski87 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some conversation with an agent perhaps at some point a sensible improvement to this implementation would be extracting the registration path. As right now:

  • Cold adapter work is competing for resources with standard requests in inference, decreasing stability.
  • The adapter lifecycle has now multiple owners - inference, vLLM sidecar - no one place to handle the cleanup.

Perhaps something to think about in the future.

patched_weights_path = os.path.join(dst_dir, ADAPTER_WEIGHTS_FILE)
save_file(remapped_tensors, patched_weights_path)
report.patched_weights_digest = _sha256_of_file(patched_weights_path)
with open(os.path.join(dst_dir, ADAPTER_CONFIG_FILE), "w") as f:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another worker could load the directory when this file is being written - getting a partially written adapter dir. Let's:

  • write to a temporary directory, and when all the files are present it can replace the cache dir.
  • lock during cache dir replace and load

"the trained DoRA adapter."
)
elif policy == "svd":
if base_dir is None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base_dir is pruned by the adapter_manager (lines 216:220) so this will never be available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants