perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson)#2477
Draft
sberan wants to merge 2 commits into
Draft
perf(rfdetr): route numpy preprocessing through the GPU tensor path (~30→6ms/frame on Jetson)#2477sberan wants to merge 2 commits into
sberan wants to merge 2 commits into
Conversation
The numpy/uint8 branch of pre_process_network_input used the CPU PIL chain (_pre_process_numpy): a numpy reverse-copy for BGR->RGB plus a Pillow F.resize. Profiling the served RFDetrForObjectDetectionTRT on Jetson/Orin NX showed pre_process ~30ms (vs ~8.6ms TRT forward), dominated by: - ~16ms numpy.ascontiguousarray(image[:, :, ::-1]) (reverse-copy for PIL) - ~9ms torchvision -> PIL.Image.resize(antialias=True) Upload the uint8 frame to a float CHW tensor on the target device and reuse the existing _pre_process_tensor path, so the channel-swap, F.resize(antialias) and normalize run on-device and the output is the tensor the model consumes -- no host round-trip, no CPU PIL/resize. Measured on-device (rfdetr-nano TRT, Orin NX): pre_process 30 -> 5.8ms, end-to-end inference 20 -> 25 fps, container CPU ~229% -> 187%; predictions unchanged. Note: numpy inputs now use tensor F.resize (same as the float-tensor input branch and predict()) rather than PIL F.resize, which is not byte-identical. _pre_process_numpy is left in place but is now unused by this path. Recommend an mAP check vs the PIL baseline before treating this as the default for all servers.
a1f0fce to
b0a9a86
Compare
b0a9a86 to
a1f0fce
Compare
Non-stretch dataset-version resizes (letterbox / center-crop / fit / letterbox-reflect) must be replayed via the PIL two-step path, which also sets nonsquare_intermediate_size that post-processing uses to map boxes back. The tensor path does a single stretch and sets neither, so gate it on not _needs_two_step_resize(); keep _pre_process_numpy for those models.
PawelPeczek-Roboflow
requested changes
Jun 19, 2026
PawelPeczek-Roboflow
left a comment
Collaborator
There was a problem hiding this comment.
The change violates tests assertions
Contributor
Author
Thanks - note that it's still in draft. It's giving a massive speed up for a simple rfdertr-nano. Will continue to evaluate. |
Collaborator
|
yeah - sure, wanted to merge it but noticed tests issues |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Routes the RF-DETR numpy / uint8 preprocessing branch through the existing GPU tensor path: in
pre_process_network_input, the numpy frame is uploaded to the target device as a float CHW tensor in[0, 1]and handed to_pre_process_tensor(on-device channel-swap +F.resize(antialias)+ normalize), instead of the CPU PIL chain (numpy BGR→RGB reverse-copy + Pillow resize).On-device measurement (AI1 / Orin NX, rfdetr):
pre_process~30 → ~6 ms/frame.Why
On the inference-server path, the model is always fed numpy (the adapter decodes every request to BGR
np.ndarraybefore callingpre_process), so today every served frame takes the PIL branch — the dominant per-frame cost on edge/Jetson. The float-tensor branch already exists and is exercised bypredict()/ direct tensor callers; this points the numpy branch at the same code.Notes / caveats
F.resizeis not byte-identical to PILF.resize. This makes numpy serving matchpredict()'s tensor-path numerics rather than the old PIL numerics. Recommend an mAP check vs the PIL baseline before treating this as the default for all servers._pre_process_numpyis left in place (now unused by this path).pre_processing.py(+17/−2).Test plan
pre_processlatency drop confirmed, detections unchanged on a stretch-mode rfdetr model.