Skip to content

perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson)#2477

Draft
sberan wants to merge 2 commits into
mainfrom
perf/rfdetr-preprocess-gpu-tensor
Draft

perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson)#2477
sberan wants to merge 2 commits into
mainfrom
perf/rfdetr-preprocess-gpu-tensor

Conversation

@sberan

@sberan sberan commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What

Routes the RF-DETR numpy / uint8 preprocessing branch through the existing GPU tensor path: in pre_process_network_input, the numpy frame is uploaded to the target device as a float CHW tensor in [0, 1] and handed to _pre_process_tensor (on-device channel-swap + F.resize(antialias) + normalize), instead of the CPU PIL chain (numpy BGR→RGB reverse-copy + Pillow resize).

On-device measurement (AI1 / Orin NX, rfdetr): pre_process ~30 → ~6 ms/frame.

Why

On the inference-server path, the model is always fed numpy (the adapter decodes every request to BGR np.ndarray before calling pre_process), so today every served frame takes the PIL branch — the dominant per-frame cost on edge/Jetson. The float-tensor branch already exists and is exercised by predict() / direct tensor callers; this points the numpy branch at the same code.

Notes / caveats

  • Tensor F.resize is not byte-identical to PIL F.resize. This makes numpy serving match predict()'s tensor-path numerics rather than the old PIL numerics. Recommend an mAP check vs the PIL baseline before treating this as the default for all servers.
  • _pre_process_numpy is left in place (now unused by this path).
  • Minimal diff: one file, pre_processing.py (+17/−2).

Test plan

  • On-device: pre_process latency drop confirmed, detections unchanged on a stretch-mode rfdetr model.
  • mAP parity check (tensor vs PIL resize) before flipping the default for hosted/serverless.

The numpy/uint8 branch of pre_process_network_input used the CPU PIL chain
(_pre_process_numpy): a numpy reverse-copy for BGR->RGB plus a Pillow F.resize.
Profiling the served RFDetrForObjectDetectionTRT on Jetson/Orin NX showed
pre_process ~30ms (vs ~8.6ms TRT forward), dominated by:
  - ~16ms  numpy.ascontiguousarray(image[:, :, ::-1])   (reverse-copy for PIL)
  - ~9ms   torchvision -> PIL.Image.resize(antialias=True)

Upload the uint8 frame to a float CHW tensor on the target device and reuse the
existing _pre_process_tensor path, so the channel-swap, F.resize(antialias) and
normalize run on-device and the output is the tensor the model consumes -- no
host round-trip, no CPU PIL/resize.

Measured on-device (rfdetr-nano TRT, Orin NX): pre_process 30 -> 5.8ms, end-to-end
inference 20 -> 25 fps, container CPU ~229% -> 187%; predictions unchanged.

Note: numpy inputs now use tensor F.resize (same as the float-tensor input branch
and predict()) rather than PIL F.resize, which is not byte-identical. _pre_process_numpy
is left in place but is now unused by this path. Recommend an mAP check vs the PIL
baseline before treating this as the default for all servers.
@sberan sberan marked this pull request as draft June 18, 2026 02:24
@sberan sberan force-pushed the perf/rfdetr-preprocess-gpu-tensor branch from a1f0fce to b0a9a86 Compare June 18, 2026 03:30
@sberan sberan changed the title perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson) perf(rfdetr): upload numpy frames to GPU tensors at the TRT model boundary (~30→6ms/frame on Jetson) Jun 18, 2026
@sberan sberan force-pushed the perf/rfdetr-preprocess-gpu-tensor branch from b0a9a86 to a1f0fce Compare June 18, 2026 03:35
@sberan sberan changed the title perf(rfdetr): upload numpy frames to GPU tensors at the TRT model boundary (~30→6ms/frame on Jetson) perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson) Jun 18, 2026
Non-stretch dataset-version resizes (letterbox / center-crop / fit /
letterbox-reflect) must be replayed via the PIL two-step path, which also
sets nonsquare_intermediate_size that post-processing uses to map boxes back.
The tensor path does a single stretch and sets neither, so gate it on
not _needs_two_step_resize(); keep _pre_process_numpy for those models.

@PawelPeczek-Roboflow PawelPeczek-Roboflow left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change violates tests assertions

@sberan

sberan commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

The change violates tests assertions

Thanks - note that it's still in draft. It's giving a massive speed up for a simple rfdertr-nano. Will continue to evaluate.

@PawelPeczek-Roboflow

Copy link
Copy Markdown
Collaborator

yeah - sure, wanted to merge it but noticed tests issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants