perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson) by sberan · Pull Request #2477 · roboflow/inference

sberan · 2026-06-18T02:10:09Z

What

Routes the RF-DETR numpy / uint8 preprocessing branch through the existing GPU tensor path: in pre_process_network_input, the numpy frame is uploaded to the target device as a float CHW tensor in [0, 1] and handed to _pre_process_tensor (on-device channel-swap + F.resize(antialias) + normalize), instead of the CPU PIL chain (numpy BGR→RGB reverse-copy + Pillow resize).

On-device measurement (AI1 / Orin NX, rfdetr): pre_process ~30 → ~6 ms/frame.

Why

On the inference-server path, the model is always fed numpy (the adapter decodes every request to BGR np.ndarray before calling pre_process), so today every served frame takes the PIL branch — the dominant per-frame cost on edge/Jetson. The float-tensor branch already exists and is exercised by predict() / direct tensor callers; this points the numpy branch at the same code.

Notes / caveats

Tensor F.resize is not byte-identical to PIL F.resize. This makes numpy serving match predict()'s tensor-path numerics rather than the old PIL numerics. Recommend an mAP check vs the PIL baseline before treating this as the default for all servers.
_pre_process_numpy is left in place (now unused by this path).
Minimal diff: one file, pre_processing.py (+17/−2).

Test plan

On-device: pre_process latency drop confirmed, detections unchanged on a stretch-mode rfdetr model.
mAP parity check (tensor vs PIL resize) before flipping the default for hosted/serverless.

The numpy/uint8 branch of pre_process_network_input used the CPU PIL chain (_pre_process_numpy): a numpy reverse-copy for BGR->RGB plus a Pillow F.resize. Profiling the served RFDetrForObjectDetectionTRT on Jetson/Orin NX showed pre_process ~30ms (vs ~8.6ms TRT forward), dominated by: - ~16ms numpy.ascontiguousarray(image[:, :, ::-1]) (reverse-copy for PIL) - ~9ms torchvision -> PIL.Image.resize(antialias=True) Upload the uint8 frame to a float CHW tensor on the target device and reuse the existing _pre_process_tensor path, so the channel-swap, F.resize(antialias) and normalize run on-device and the output is the tensor the model consumes -- no host round-trip, no CPU PIL/resize. Measured on-device (rfdetr-nano TRT, Orin NX): pre_process 30 -> 5.8ms, end-to-end inference 20 -> 25 fps, container CPU ~229% -> 187%; predictions unchanged. Note: numpy inputs now use tensor F.resize (same as the float-tensor input branch and predict()) rather than PIL F.resize, which is not byte-identical. _pre_process_numpy is left in place but is now unused by this path. Recommend an mAP check vs the PIL baseline before treating this as the default for all servers.

Non-stretch dataset-version resizes (letterbox / center-crop / fit / letterbox-reflect) must be replayed via the PIL two-step path, which also sets nonsquare_intermediate_size that post-processing uses to map boxes back. The tensor path does a single stretch and sets neither, so gate it on not _needs_two_step_resize(); keep _pre_process_numpy for those models.

PawelPeczek-Roboflow

The change violates tests assertions

sberan · 2026-06-19T13:22:18Z

The change violates tests assertions

Thanks - note that it's still in draft. It's giving a massive speed up for a simple rfdertr-nano. Will continue to evaluate.

PawelPeczek-Roboflow · 2026-06-19T13:46:38Z

yeah - sure, wanted to merge it but noticed tests issues

sberan requested review from PawelPeczek-Roboflow, dkosowski87, grzegorz-roboflow, hansent, probicheaux, rafel-roboflow and yeldarby as code owners June 18, 2026 02:10

sberan marked this pull request as draft June 18, 2026 02:24

sberan force-pushed the perf/rfdetr-preprocess-gpu-tensor branch from a1f0fce to b0a9a86 Compare June 18, 2026 03:30

sberan changed the title ~~perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson)~~ perf(rfdetr): upload numpy frames to GPU tensors at the TRT model boundary (~30→6ms/frame on Jetson) Jun 18, 2026

sberan force-pushed the perf/rfdetr-preprocess-gpu-tensor branch from b0a9a86 to a1f0fce Compare June 18, 2026 03:35

sberan changed the title ~~perf(rfdetr): upload numpy frames to GPU tensors at the TRT model boundary (~30→6ms/frame on Jetson)~~ perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson) Jun 18, 2026

PawelPeczek-Roboflow requested changes Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson)#2477

perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson)#2477
sberan wants to merge 2 commits into
mainfrom
perf/rfdetr-preprocess-gpu-tensor

sberan commented Jun 18, 2026 •

edited

Loading

Uh oh!

PawelPeczek-Roboflow left a comment

Uh oh!

sberan commented Jun 19, 2026

Uh oh!

PawelPeczek-Roboflow commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sberan commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Notes / caveats

Test plan

Uh oh!

PawelPeczek-Roboflow left a comment

Choose a reason for hiding this comment

Uh oh!

sberan commented Jun 19, 2026

Uh oh!

PawelPeczek-Roboflow commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sberan commented Jun 18, 2026 •

edited

Loading