sycl : clamp softmax input to avoid underflow#24941
Open
Jassieluo wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR fixes a numerical stability bug in the SYCL softmax kernel where all-masked inputs ( -INFINITY ) could lead to NaN propagation.
When all items in a row are masked out with -INFINITY , the normalized input vals[col] - max_val evaluates to (-inf) - (-inf) = NaN . Without clamping, sycl::native::exp(NaN) returns NaN , causing the entire row of softmax output to collapse to NaN.
To resolve this, we clamp the exponent input using sycl::max(..., -80.0f) . -80.0f is chosen because it is safely above the single-precision float normalized limit ln (FLT_MIN) ≈ -87.33f. This avoids underflow to exactly 0.0f (which would still cause division-by-zero NaN under all-masked scenarios) and prevents subnormal floating-point operations (denormals) on the GPU, avoiding severe execution speed penalties.
Additional information
Here are the test results comparing the output of the SYCL softmax kernel with and without this fix on an Intel GPU (Intel(R) Iris(R) Xe Graphics):
Input:
[-inf, -inf, -inf](All Masked)[-nan, -nan, -nan](Failed/NaN Collapse)[0.333333, 0.333333, 0.333333](Correct uniform probability)Input:
[1.0f, 2.0f, -1e9f](Normal Masking)Without fix (Original):
[0.268941, 0.731059, 0.0f]With fix (This PR):
[0.268941, 0.731059, 1.31945e-35](No change in precision for normal values)Verification Code:
Requirements
[✓] I have read and agree with the contributing guidelines https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md
• AI usage disclosure: YES. AI was used to assist in porting a BERT model to SYCL. When NaN outputs were observed during testing, I personally debugged the codebase and identified the root cause in the softmax kernel. AI was subsequently used to write the standalone verification test script and refine/translate this PR description.
Note on Cross-Backend & FP16 Behavior
To verify whether this is unique to SYCL, we conducted standalone tests on CPU (AVX2) and CUDA (NVIDIA RTX 4060 Laptop) backends with the following findings:
1. IEEE 754 Standard Behavior: The
(-inf) - (-inf) = NaNpropagation is a universal math behavior. On CPU and CUDA, feeding all-INFINITYinputs to their respective softmax kernels without clamping similarly yieldsNaN.2. FP16 Mask Overflow:
- Under FP32, a mask of
-1e9fworks correctly without clamping because it stays within representation bounds ((-1e9) - (-1e9) = 0).- Under FP16 (half-precision), any mask value like
-1e9foverflows the minimum limit of-65504and gets rounded to-INFINITY. Consequently, any fully-masked padding row (common in BERT/embeddings) turns into all-INFINITYinputs, riggering theNaNcollapse across all backends.