feat: vLLM proxy backend — serve VLM fine-tunes via vLLM with dynamic LoRA#2434
Open
hansent wants to merge 15 commits into
Open
feat: vLLM proxy backend — serve VLM fine-tunes via vLLM with dynamic LoRA#2434hansent wants to merge 15 commits into
hansent wants to merge 15 commits into
Conversation
New inference/models/vllm_proxy package: the inference server runs CPU-only in front of a vLLM sidecar (OpenAI-compatible API) that owns the GPU and does continuous batching. Per-request auth/billing/model resolution/preprocessing stay in the inference server unchanged. - vllm_client: chat completions + runtime LoRA load/unload - adapter_manager: model_id -> Roboflow package resolution, adapter-only download, identity = model_id + package_id + digest, LRU eviction - adapter_patch: key remap (model.layers -> language_model path), vision-tower filtering, DoRA policies (reject/strip/svd) incl. exact-delta SVD conversion verified against peft merge math - qwen3_5_vllm: Qwen35VLLMProxy with preprocess/postprocess parity to the HF path (think-tag parsing parity-tested against Qwen35HF) Enabled per pool via VLLM_PROXY_ENABLED; selection switch in inference/models/utils.py.
Generalize the proxy into QwenVLLMProxyBase with family-specific knobs; add Qwen3VLVLLMProxy (no thinking mode, qwen3vl pixel budget/prompts). Adapter manager base-variant matching normalized to <architecture>-<variant minus -peft> and served-name short-circuit. Registration switch extended to qwen3vl ids behind VLLM_PROXY_ENABLED.
…correlation Real incident: registry metadata for image-text/223 says 0.8b-peft but the adapter's own config says base qwen3_5-2b — the mislabeled adapter reached vLLM and died with an opaque tensor-shape 500. Now: - patch_adapter cross-checks adapter_config base_model_name_or_path against the pool's served base and rejects with a message naming both values (and that the registry record is the thing to fix) - vLLM 5xx on load_lora_adapter wraps into a typed 501 with the slug, model id, and response excerpt; connection errors stay retryable - all vLLM API calls carry X-Request-Id from the execution_id / correlation contextvars so vLLM logs correlate with platform logs
…registry variant advisory Registry modelVariant is sometimes misregistered while the adapter's own adapter_config.json is always correct (written with the weights). The manager no longer rejects on variant mismatch — it defers to the post-download cross-check against base_model_name_or_path, and logs a drift WARN when the registry disagrees (passive misregistration audit, also recorded in patch_report.json). Architecture gate stays pre-download.
The vLLM pools run 64+ concurrent proxied requests per pod; the anyio default of 40 threads silently caps sync-handler concurrency below the consumer's permit count. Identical implementation as on feat/vlm-dynamic-batching so the branches merge cleanly.
… self-heal Multi-worker correctness (NUM_WORKERS>1: one shared vLLM engine, one AdapterManager per process): never auto-unload adapters (vLLM's --max-cpu-loras LRU owns memory and refills from disk), always re-issue the idempotent load call at instance creation even when the local map says registered (it may be stale), and on an unknown-LoRA 404 in the request path invalidate + re-register + retry exactly once.
# Conflicts: # inference/core/workflows/core_steps/visualizations/keypoint/v1.py # inference/models/utils.py
dkosowski87
reviewed
Jun 22, 2026
dkosowski87
left a comment
Contributor
There was a problem hiding this comment.
After some conversation with an agent perhaps at some point a sensible improvement to this implementation would be extracting the registration path. As right now:
- Cold adapter work is competing for resources with standard requests in
inference, decreasing stability. - The adapter lifecycle has now multiple owners - inference, vLLM sidecar - no one place to handle the cleanup.
Perhaps something to think about in the future.
| patched_weights_path = os.path.join(dst_dir, ADAPTER_WEIGHTS_FILE) | ||
| save_file(remapped_tensors, patched_weights_path) | ||
| report.patched_weights_digest = _sha256_of_file(patched_weights_path) | ||
| with open(os.path.join(dst_dir, ADAPTER_CONFIG_FILE), "w") as f: |
Contributor
There was a problem hiding this comment.
Another worker could load the directory when this file is being written - getting a partially written adapter dir. Let's:
- write to a temporary directory, and when all the files are present it can replace the cache dir.
- lock during cache dir replace and load
| "the trained DoRA adapter." | ||
| ) | ||
| elif policy == "svd": | ||
| if base_dir is None: |
Contributor
There was a problem hiding this comment.
The base_dir is pruned by the adapter_manager (lines 216:220) so this will never be available.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
New
inference/models/vllm_proxy/package: the inference server runs CPU-only in front of a vLLM container that owns the GPU and does continuous batching. Per-request auth/billing/model-resolution/preprocessing stay in the inference server unchanged — batching happens below the per-request HTTP surface, so@usage_collectorand the auth middleware need zero changes.qwen_vllm_base.py+ family classes (qwen3_5, qwen3vl): Model adapters with preprocess/postprocess parity to the HF path (think-tag handling parity-tested againstQwen35HF).adapter_manager.py: model_id → registry resolution → adapter-only download (~MBs; base weights never re-downloaded) → patch → runtime/v1/load_lora_adapter→ LRU eviction. Adapter identity = model_id + package_id + content digest (package ids are not unique per version).adapter_patch.py: deterministic transform — key remapping, vision-tower filtering, DoRA policies (reject / strip / SVD-convert; strip validated byte-exact on one production adapter class, ~0.91-similar on a denser class → per-adapter accuracy gate is the admission rule). SVD merge math verified against PEFT's ownmerge_and_unload.modelVariantproved misregistered on real models (image-text/223 et al.); the adapter's ownadapter_config.jsonbase_model_name_or_pathis the authority, registry variant is advisory with a misregistration WARN + patch_report record.X-Request-Idfrom theexecution_id/correlation contextvars; vLLM 5xx on adapter load wraps into typed errors naming slug/model/response.Enabled per pool via
VLLM_PROXY_ENABLED(selection switch ininference/models/utils.py); zero behavior change otherwise.Validation (cru-staging, full serverless path)
Companion PRs: async-serverless (pools + routing), roboflow-infra (vLLM image + DoRA lab), inference (micro-batching, separate branch).