llama : synchronize context before backend teardown by krystophny · Pull Request #24935 · ggml-org/llama.cpp

krystophny · 2026-06-23T07:54:55Z

Overview

~llama_context() releases backend buffers without first draining
outstanding GPU work. On a multi-GPU CUDA build this aborts during
teardown: cudaFree reports unspecified launch failure inside
~ggml_backend_cuda_buffer_context. Calling synchronize() at the
start of the destructor drains pending work before any buffer is freed.

Additional information

Reproduced on clean master (dec5ca557), CUDA build, two RTX 5060 Ti
(sm_120), CUDA 13.3.

Before:

2/2 Test #36: test-thread-safety ...Subprocess aborted
E CUDA error: unspecified launch failure
E   current device: 1, in function ~ggml_backend_cuda_buffer_context
      at ggml/src/ggml-cuda/ggml-cuda.cu:637
E   cudaFree(dev_ptr)

After:

2/2 Test #36: test-thread-safety ...Passed
100% tests passed, 0 tests failed out of 2

The abort surfaces on Blackwell here. The change is a teardown-ordering
fix and does not depend on the GPU.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. The two-line teardown fix originates in my
earlier work; AI assisted with bisecting the failure, building, and
running test-thread-safety. I reviewed every line, own the change,
and can explain it.

The synchronize-on-teardown fix (src/llama-context.cpp) and the phi3 meta split-state fix (src/llama-model.cpp) entered this branch via an upstream merge and are unrelated to the Responses API work. They are now filed independently as ggml-org#24935 and ggml-org#24936. Remove them here so this PR is limited to the server Responses changes.

ggml-gh-bot · 2026-06-23T07:59:26Z

Hi @krystophny, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

ggerganov · 2026-06-23T08:07:24Z

It would be better to synchronize the contexts explicitly at the end of test-thread-safety. Something like:

diff --git a/tests/test-thread-safety.cpp b/tests/test-thread-safety.cpp
index acda4aa81..d0b5946e2 100644
--- a/tests/test-thread-safety.cpp
+++ b/tests/test-thread-safety.cpp
@@ -146,6 +146,8 @@ int main(int argc, char ** argv) {
                 }
 
                 LOG_INF("Model %d/%d, Context %d/%d: %s\n\n", m + 1, num_models, c + 1, num_contexts, result.c_str());
+
+                llama_synchronize(ctx.get());
             });
         }
     }

Assisted-by: Claude

krystophny · 2026-06-23T08:27:39Z

It would be better to synchronize the contexts explicitly at the end of test-thread-safety. Something like:

diff --git a/tests/test-thread-safety.cpp b/tests/test-thread-safety.cpp
index acda4aa81..d0b5946e2 100644
--- a/tests/test-thread-safety.cpp
+++ b/tests/test-thread-safety.cpp
@@ -146,6 +146,8 @@ int main(int argc, char ** argv) {
                 }
 
                 LOG_INF("Model %d/%d, Context %d/%d: %s\n\n", m + 1, num_models, c + 1, num_contexts, result.c_str());
+
+                llama_synchronize(ctx.get());
             });
         }
     }

thx for the fast look, changed it!

krystophny requested a review from ggerganov as a code owner June 23, 2026 07:54

tests : synchronize contexts at end of test-thread-safety

e654df6

Assisted-by: Claude

krystophny force-pushed the fix/cuda-ctx-synchronize-on-destroy branch from be1fd16 to e654df6 Compare June 23, 2026 08:21

github-actions Bot added the testing Everything test related label Jun 23, 2026

krystophny mentioned this pull request Jun 23, 2026

server: improve Responses API compliance and Codex CLI compatibility #21174

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : synchronize context before backend teardown#24935

llama : synchronize context before backend teardown#24935
krystophny wants to merge 1 commit into
ggml-org:masterfrom
krystophny:fix/cuda-ctx-synchronize-on-destroy

krystophny commented Jun 23, 2026

Uh oh!

ggml-gh-bot Bot commented Jun 23, 2026

Uh oh!

ggerganov commented Jun 23, 2026

Uh oh!

krystophny commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krystophny commented Jun 23, 2026

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 23, 2026

Uh oh!

ggerganov commented Jun 23, 2026

Uh oh!

krystophny commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants