feat: vLLM proxy backend — serve VLM fine-tunes via vLLM with dynamic LoRA by hansent · Pull Request #2434 · roboflow/inference

hansent · 2026-06-10T15:37:19Z

What

New inference/models/vllm_proxy/ package: the inference server runs CPU-only in front of a vLLM container that owns the GPU and does continuous batching. Per-request auth/billing/model-resolution/preprocessing stay in the inference server unchanged — batching happens below the per-request HTTP surface, so @usage_collector and the auth middleware need zero changes.

qwen_vllm_base.py + family classes (qwen3_5, qwen3vl): Model adapters with preprocess/postprocess parity to the HF path (think-tag handling parity-tested against Qwen35HF).
adapter_manager.py: model_id → registry resolution → adapter-only download (~MBs; base weights never re-downloaded) → patch → runtime /v1/load_lora_adapter → LRU eviction. Adapter identity = model_id + package_id + content digest (package ids are not unique per version).
adapter_patch.py: deterministic transform — key remapping, vision-tower filtering, DoRA policies (reject / strip / SVD-convert; strip validated byte-exact on one production adapter class, ~0.91-similar on a denser class → per-adapter accuracy gate is the admission rule). SVD merge math verified against PEFT's own merge_and_unload.
File-authoritative base matching: registry modelVariant proved misregistered on real models (image-text/223 et al.); the adapter's own adapter_config.json base_model_name_or_path is the authority, registry variant is advisory with a misregistration WARN + patch_report record.
Observability: all vLLM calls carry X-Request-Id from the execution_id/correlation contextvars; vLLM 5xx on adapter load wraps into typed errors naming slug/model/response.

Enabled per pool via VLLM_PROXY_ENABLED (selection switch in inference/models/utils.py); zero behavior change otherwise.

Validation (cru-staging, full serverless path)

qwen3vl-2b fine-tune: 38 RPS @ c64 (baseline 1.2 RPS, ~32×), zero errors; two adapters mixed: 40.9 RPS (no multi-LoRA penalty); qwen3_5-0.8b/2b pools: 25 / 22 RPS @ c32.
Cold dynamic-LoRA path: resolve → download → strip → load in ~2.4s.
140 unit tests (CPU-only, mocked vLLM).

Companion PRs: async-serverless (pools + routing), roboflow-infra (vLLM image + DoRA lab), inference (micro-batching, separate branch).

New inference/models/vllm_proxy package: the inference server runs CPU-only in front of a vLLM sidecar (OpenAI-compatible API) that owns the GPU and does continuous batching. Per-request auth/billing/model resolution/preprocessing stay in the inference server unchanged. - vllm_client: chat completions + runtime LoRA load/unload - adapter_manager: model_id -> Roboflow package resolution, adapter-only download, identity = model_id + package_id + digest, LRU eviction - adapter_patch: key remap (model.layers -> language_model path), vision-tower filtering, DoRA policies (reject/strip/svd) incl. exact-delta SVD conversion verified against peft merge math - qwen3_5_vllm: Qwen35VLLMProxy with preprocess/postprocess parity to the HF path (think-tag parsing parity-tested against Qwen35HF) Enabled per pool via VLLM_PROXY_ENABLED; selection switch in inference/models/utils.py.

Generalize the proxy into QwenVLLMProxyBase with family-specific knobs; add Qwen3VLVLLMProxy (no thinking mode, qwen3vl pixel budget/prompts). Adapter manager base-variant matching normalized to <architecture>-<variant minus -peft> and served-name short-circuit. Registration switch extended to qwen3vl ids behind VLLM_PROXY_ENABLED.

…correlation Real incident: registry metadata for image-text/223 says 0.8b-peft but the adapter's own config says base qwen3_5-2b — the mislabeled adapter reached vLLM and died with an opaque tensor-shape 500. Now: - patch_adapter cross-checks adapter_config base_model_name_or_path against the pool's served base and rejects with a message naming both values (and that the registry record is the thing to fix) - vLLM 5xx on load_lora_adapter wraps into a typed 501 with the slug, model id, and response excerpt; connection errors stay retryable - all vLLM API calls carry X-Request-Id from the execution_id / correlation contextvars so vLLM logs correlate with platform logs

…registry variant advisory Registry modelVariant is sometimes misregistered while the adapter's own adapter_config.json is always correct (written with the weights). The manager no longer rejects on variant mismatch — it defers to the post-download cross-check against base_model_name_or_path, and logs a drift WARN when the registry disagrees (passive misregistration audit, also recorded in patch_report.json). Architecture gate stays pre-download.

The vLLM pools run 64+ concurrent proxied requests per pod; the anyio default of 40 threads silently caps sync-handler concurrency below the consumer's permit count. Identical implementation as on feat/vlm-dynamic-batching so the branches merge cleanly.

… self-heal Multi-worker correctness (NUM_WORKERS>1: one shared vLLM engine, one AdapterManager per process): never auto-unload adapters (vLLM's --max-cpu-loras LRU owns memory and refills from disk), always re-issue the idempotent load call at instance creation even when the local map says registered (it may be stale), and on an unknown-LoRA 404 in the request path invalidate + re-register + retry exactly once.

# Conflicts: # inference/core/workflows/core_steps/visualizations/keypoint/v1.py # inference/models/utils.py

dkosowski87

After some conversation with an agent perhaps at some point a sensible improvement to this implementation would be extracting the registration path. As right now:

Cold adapter work is competing for resources with standard requests in inference, decreasing stability.
The adapter lifecycle has now multiple owners - inference, vLLM sidecar - no one place to handle the cleanup.

Perhaps something to think about in the future.

dkosowski87 · 2026-06-22T17:32:15Z

+    patched_weights_path = os.path.join(dst_dir, ADAPTER_WEIGHTS_FILE)
+    save_file(remapped_tensors, patched_weights_path)
+    report.patched_weights_digest = _sha256_of_file(patched_weights_path)
+    with open(os.path.join(dst_dir, ADAPTER_CONFIG_FILE), "w") as f:


Another worker could load the directory when this file is being written - getting a partially written adapter dir. Let's:

write to a temporary directory, and when all the files are present it can replace the cache dir.

lock during cache dir replace and load

dkosowski87 · 2026-06-22T17:36:13Z

+                "the trained DoRA adapter."
+            )
+        elif policy == "svd":
+            if base_dir is None:


The base_dir is pruned by the adapter_manager (lines 216:220) so this will never be available.

hansent added 4 commits June 9, 2026 14:10

hansent requested review from PawelPeczek-Roboflow, dkosowski87, grzegorz-roboflow, probicheaux, rafel-roboflow and yeldarby as code owners June 10, 2026 15:37

hansent added 2 commits June 10, 2026 10:50

Merge remote-tracking branch 'origin/main' into feat/vllm-proxy-backend

8507504

hansent marked this pull request as draft June 10, 2026 15:58

hansent added 2 commits June 10, 2026 13:06

Merge branch 'main' into feat/vllm-proxy-backend

3186a8b

hansent marked this pull request as ready for review June 16, 2026 13:43

hansent added 4 commits June 16, 2026 08:51

Fix vLLM proxy CI checks

7a5497f

Apply code quality formatting

692a79d

Merge remote-tracking branch 'origin/main' into feat/vllm-proxy-backend

06390ab

# Conflicts: # inference/core/workflows/core_steps/visualizations/keypoint/v1.py # inference/models/utils.py

Merge remote-tracking branch 'origin/main' into feat/vllm-proxy-backend

5b4fe03

dkosowski87 added the review-auto label Jun 22, 2026

dkosowski87 reviewed Jun 22, 2026

View reviewed changes

hansent added 3 commits June 22, 2026 13:53

Fix vLLM adapter cache publication race

dd4a95b

Merge remote-tracking branch 'origin/main' into feat/vllm-proxy-backend

1e30657

Clarify runtime DoRA SVD policy

0ec4b1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: vLLM proxy backend — serve VLM fine-tunes via vLLM with dynamic LoRA#2434

feat: vLLM proxy backend — serve VLM fine-tunes via vLLM with dynamic LoRA#2434
hansent wants to merge 15 commits into
mainfrom
feat/vllm-proxy-backend

hansent commented Jun 10, 2026

Uh oh!

dkosowski87 left a comment

Uh oh!

dkosowski87 Jun 22, 2026

Uh oh!

dkosowski87 Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hansent commented Jun 10, 2026

What

Validation (cru-staging, full serverless path)

Uh oh!

dkosowski87 left a comment

Choose a reason for hiding this comment

Uh oh!

dkosowski87 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

dkosowski87 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants