sycl : clamp softmax input to avoid underflow by Jassieluo · Pull Request #24941 · ggml-org/llama.cpp

Jassieluo · 2026-06-23T11:18:11Z

Overview

This PR fixes a numerical stability bug in the SYCL softmax kernel where all-masked inputs ( -INFINITY ) could lead to NaN propagation.

When all items in a row are masked out with -INFINITY , the normalized input vals[col] - max_val evaluates to (-inf) - (-inf) = NaN . Without clamping, sycl::native::exp(NaN) returns NaN , causing the entire row of softmax output to collapse to NaN.

To resolve this, we clamp the exponent input using sycl::max(..., -80.0f) . -80.0f is chosen because it is safely above the single-precision float normalized limit ln (FLT_MIN) ≈ -87.33f. This avoids underflow to exactly 0.0f (which would still cause division-by-zero NaN under all-masked scenarios) and prevents subnormal floating-point operations (denormals) on the GPU, avoiding severe execution speed penalties.

Additional information

Here are the test results comparing the output of the SYCL softmax kernel with and without this fix on an Intel GPU (Intel(R) Iris(R) Xe Graphics):

Input: [-inf, -inf, -inf] (All Masked)

Without fix (Original): [-nan, -nan, -nan] (Failed/NaN Collapse)
With fix (This PR): [0.333333, 0.333333, 0.333333] (Correct uniform probability)

Input: [1.0f, 2.0f, -1e9f] (Normal Masking)

Without fix (Original): [0.268941, 0.731059, 0.0f]

With fix (This PR): [0.268941, 0.731059, 1.31945e-35] (No change in precision for normal values)

Verification Code:

template <bool clamp, typename T>
void run_official_softmax_kernel(sycl::queue& q, const float* d_in, const T* d_mask, float* d_out, int ncols) {
    soft_max_params params = {};
    params.ncols = ncols;
    params.scale = 1.0f;
    params.max_bias = 0.0f;
    params.ne12 = 1;
    params.ne13 = 1;
    params.nb11 = sizeof(T);
    params.nb12 = sizeof(T);
    params.nb13 = sizeof(T);

    int nth = 32;
    const dpct::dim3 block_dims(nth, 1, 1);
    const dpct::dim3 block_nums(1, 1, 1);
    const size_t nbytes_shared = (32 + 32) * sizeof(float);

    q.submit([&](sycl::handler &cgh) {
        sycl::local_accessor<uint8_t, 1> dpct_local_acc_ct1(
            sycl::range<1>(nbytes_shared), cgh);

        cgh.parallel_for(
            sycl::nd_range<3>(block_nums * block_dims, block_dims),
            [=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(32)]] {
                if constexpr (clamp) {
                    soft_max_f32_clamp<true, 0, 0, T>(
                        d_in, d_mask, nullptr, d_out, params,
                        dpct_local_acc_ct1
                            .get_multi_ptr<sycl::access::decorated::no>()
                            .get());
                } else {
                    soft_max_f32_no_clamp<true, 0, 0, T>(
                        d_in, d_mask, nullptr, d_out, params,
                        dpct_local_acc_ct1
                            .get_multi_ptr<sycl::access::decorated::no>()
                            .get());
                }
            });
    }).wait();
}
int main() {
    sycl::queue q;

    int ncols = 3;
    float input_mask_large[] = { 1.0f, 2.0f, -1e9f };
    float input_all_inf[] = { -INFINITY, -INFINITY, -INFINITY };

    float* d_in = sycl::malloc_shared<float>(ncols, q);
    float* d_out = sycl::malloc_shared<float>(ncols, q);

    // Test 1: All Infinities without clamp
    std::copy(input_all_inf, input_all_inf + ncols, d_in);
    run_official_softmax_kernel<false, float>(q, d_in, nullptr, d_out, ncols);
    std::cout << "All -inf (Without clamp): [ " << d_out[0] << " " << d_out[1] << " " << d_out[2] << " ]" << std::endl;

    // Test 2: All Infinities with clamp
    run_official_softmax_kernel<true, float>(q, d_in, nullptr, d_out, ncols);
    std::cout << "All -inf (With clamp):    [ " << d_out[0] << " " << d_out[1] << " " << d_out[2] << " ]" << std::endl << std::endl;

    // Test 3: Standard Input with Large Mask without clamp
    std::copy(input_mask_large, input_mask_large + ncols, d_in);
    run_official_softmax_kernel<false, float>(q, d_in, nullptr, d_out, ncols);
    std::cout << "Mask -1e9 (Without clamp): [ " << d_out[0] << " " << d_out[1] << " " << d_out[2] << " ]" << std::endl;

    // Test 4: Standard Input with Large Mask with clamp
    run_official_softmax_kernel<true, float>(q, d_in, nullptr, d_out, ncols);
    std::cout << "Mask -1e9 (With clamp):    [ " << d_out[0] << " " << d_out[1] << " " << d_out[2] << " ]" << std::endl;

    sycl::free(d_in, q);
    sycl::free(d_out, q);
    return 0;
}

Requirements

[✓] I have read and agree with the contributing guidelines https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md
• AI usage disclosure: YES. AI was used to assist in porting a BERT model to SYCL. When NaN outputs were observed during testing, I personally debugged the codebase and identified the root cause in the softmax kernel. AI was subsequently used to write the standalone verification test script and refine/translate this PR description.

Note on Cross-Backend & FP16 Behavior

To verify whether this is unique to SYCL, we conducted standalone tests on CPU (AVX2) and CUDA (NVIDIA RTX 4060 Laptop) backends with the following findings:
1. IEEE 754 Standard Behavior: The (-inf) - (-inf) = NaN propagation is a universal math behavior. On CPU and CUDA, feeding all -INFINITY inputs to their respective softmax kernels without clamping similarly yields NaN.
2. FP16 Mask Overflow:
- Under FP32, a mask of -1e9f works correctly without clamping because it stays within representation bounds ((-1e9) - (-1e9) = 0).
- Under FP16 (half-precision), any mask value like -1e9f overflows the minimum limit of -65504 and gets rounded to -INFINITY. Consequently, any fully-masked padding row (common in BERT/embeddings) turns into all -INFINITY inputs, riggering the NaN collapse across all backends.

arthw

It's good job!

Thank you!

sycl : clamp softmax input to avoid underflow

9ede198

Jassieluo requested a review from a team as a code owner June 23, 2026 11:18

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 23, 2026

arthw approved these changes Jun 23, 2026

View reviewed changes

arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sycl : clamp softmax input to avoid underflow#24941

sycl : clamp softmax input to avoid underflow#24941
Jassieluo wants to merge 1 commit into
ggml-org:masterfrom
Jassieluo:sycl-softmax-underflow

Jassieluo commented Jun 23, 2026 •

edited

Loading

Uh oh!

arthw left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jassieluo commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Verification Code:

Requirements

Note on Cross-Backend & FP16 Behavior

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jassieluo commented Jun 23, 2026 •

edited

Loading