Skip to content

sycl : clamp softmax input to avoid underflow#24941

Open
Jassieluo wants to merge 1 commit into
ggml-org:masterfrom
Jassieluo:sycl-softmax-underflow
Open

sycl : clamp softmax input to avoid underflow#24941
Jassieluo wants to merge 1 commit into
ggml-org:masterfrom
Jassieluo:sycl-softmax-underflow

Conversation

@Jassieluo

@Jassieluo Jassieluo commented Jun 23, 2026

Copy link
Copy Markdown

Overview

This PR fixes a numerical stability bug in the SYCL softmax kernel where all-masked inputs ( -INFINITY ) could lead to NaN propagation.

When all items in a row are masked out with -INFINITY , the normalized input vals[col] - max_val evaluates to (-inf) - (-inf) = NaN . Without clamping, sycl::native::exp(NaN) returns NaN , causing the entire row of softmax output to collapse to NaN.

To resolve this, we clamp the exponent input using sycl::max(..., -80.0f) . -80.0f is chosen because it is safely above the single-precision float normalized limit ln (FLT_MIN) ≈ -87.33f. This avoids underflow to exactly 0.0f (which would still cause division-by-zero NaN under all-masked scenarios) and prevents subnormal floating-point operations (denormals) on the GPU, avoiding severe execution speed penalties.

Additional information

Here are the test results comparing the output of the SYCL softmax kernel with and without this fix on an Intel GPU (Intel(R) Iris(R) Xe Graphics):

Input: [-inf, -inf, -inf] (All Masked)

  • Without fix (Original): [-nan, -nan, -nan] (Failed/NaN Collapse)
  • With fix (This PR): [0.333333, 0.333333, 0.333333] (Correct uniform probability)

Input: [1.0f, 2.0f, -1e9f] (Normal Masking)

  • Without fix (Original): [0.268941, 0.731059, 0.0f]

  • With fix (This PR): [0.268941, 0.731059, 1.31945e-35] (No change in precision for normal values)

    Verification Code:

    template <bool clamp, typename T>
    void run_official_softmax_kernel(sycl::queue& q, const float* d_in, const T* d_mask, float* d_out, int ncols) {
        soft_max_params params = {};
        params.ncols = ncols;
        params.scale = 1.0f;
        params.max_bias = 0.0f;
        params.ne12 = 1;
        params.ne13 = 1;
        params.nb11 = sizeof(T);
        params.nb12 = sizeof(T);
        params.nb13 = sizeof(T);
    
        int nth = 32;
        const dpct::dim3 block_dims(nth, 1, 1);
        const dpct::dim3 block_nums(1, 1, 1);
        const size_t nbytes_shared = (32 + 32) * sizeof(float);
    
        q.submit([&](sycl::handler &cgh) {
            sycl::local_accessor<uint8_t, 1> dpct_local_acc_ct1(
                sycl::range<1>(nbytes_shared), cgh);
    
            cgh.parallel_for(
                sycl::nd_range<3>(block_nums * block_dims, block_dims),
                [=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(32)]] {
                    if constexpr (clamp) {
                        soft_max_f32_clamp<true, 0, 0, T>(
                            d_in, d_mask, nullptr, d_out, params,
                            dpct_local_acc_ct1
                                .get_multi_ptr<sycl::access::decorated::no>()
                                .get());
                    } else {
                        soft_max_f32_no_clamp<true, 0, 0, T>(
                            d_in, d_mask, nullptr, d_out, params,
                            dpct_local_acc_ct1
                                .get_multi_ptr<sycl::access::decorated::no>()
                                .get());
                    }
                });
        }).wait();
    }
    int main() {
        sycl::queue q;
    
        int ncols = 3;
        float input_mask_large[] = { 1.0f, 2.0f, -1e9f };
        float input_all_inf[] = { -INFINITY, -INFINITY, -INFINITY };
    
        float* d_in = sycl::malloc_shared<float>(ncols, q);
        float* d_out = sycl::malloc_shared<float>(ncols, q);
    
        // Test 1: All Infinities without clamp
        std::copy(input_all_inf, input_all_inf + ncols, d_in);
        run_official_softmax_kernel<false, float>(q, d_in, nullptr, d_out, ncols);
        std::cout << "All -inf (Without clamp): [ " << d_out[0] << " " << d_out[1] << " " << d_out[2] << " ]" << std::endl;
    
        // Test 2: All Infinities with clamp
        run_official_softmax_kernel<true, float>(q, d_in, nullptr, d_out, ncols);
        std::cout << "All -inf (With clamp):    [ " << d_out[0] << " " << d_out[1] << " " << d_out[2] << " ]" << std::endl << std::endl;
    
        // Test 3: Standard Input with Large Mask without clamp
        std::copy(input_mask_large, input_mask_large + ncols, d_in);
        run_official_softmax_kernel<false, float>(q, d_in, nullptr, d_out, ncols);
        std::cout << "Mask -1e9 (Without clamp): [ " << d_out[0] << " " << d_out[1] << " " << d_out[2] << " ]" << std::endl;
    
        // Test 4: Standard Input with Large Mask with clamp
        run_official_softmax_kernel<true, float>(q, d_in, nullptr, d_out, ncols);
        std::cout << "Mask -1e9 (With clamp):    [ " << d_out[0] << " " << d_out[1] << " " << d_out[2] << " ]" << std::endl;
    
        sycl::free(d_in, q);
        sycl::free(d_out, q);
        return 0;
    }
    

Requirements

[✓] I have read and agree with the contributing guidelines https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md
• AI usage disclosure: YES. AI was used to assist in porting a BERT model to SYCL. When NaN outputs were observed during testing, I personally debugged the codebase and identified the root cause in the softmax kernel. AI was subsequently used to write the standalone verification test script and refine/translate this PR description.

Note on Cross-Backend & FP16 Behavior

To verify whether this is unique to SYCL, we conducted standalone tests on CPU (AVX2) and CUDA (NVIDIA RTX 4060 Laptop) backends with the following findings:
1. IEEE 754 Standard Behavior: The (-inf) - (-inf) = NaN propagation is a universal math behavior. On CPU and CUDA, feeding all -INFINITY inputs to their respective softmax kernels without clamping similarly yields NaN.
2. FP16 Mask Overflow:
- Under FP32, a mask of -1e9f works correctly without clamping because it stays within representation bounds ((-1e9) - (-1e9) = 0).
- Under FP16 (half-precision), any mask value like -1e9f overflows the minimum limit of -65504 and gets rounded to -INFINITY. Consequently, any fully-masked padding row (common in BERT/embeddings) turns into all -INFINITY inputs, riggering the NaN collapse across all backends.

@Jassieluo Jassieluo requested a review from a team as a code owner June 23, 2026 11:18
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 23, 2026

@arthw arthw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good job!

Thank you!

@arthw arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants