Arm64: Improve support for HW_Flag_ReturnsPerElementMask#128326
Arm64: Improve support for HW_Flag_ReturnsPerElementMask#128326snickolls-arm wants to merge 8 commits into
Conversation
When wrapping an intrinsic node that has an embedded mask with a ConditionalSelect, ensure that the constant node in op3 has a mask type when the intrinsic has the HW_Flag_ReturnsPerElementMask flag. Build out further support for ConditionalSelect_Predicates, and use this to wrap nodes with HW_Flag_ReturnsPerElementMask. Add GenTree::IsSelectZero and update various areas in HW intrinsic codegen to ensure this intrinsic assembles correctly. Use a tree visitor for assigning `TYP_MASK` to intrinsics that have `HW_Flag_ReturnsPerElementMask`. The current version of `impHWIntrinsic` does not process child nodes of the tree it returns for mask types, only the root node.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
| { | ||
| GenTreeHWIntrinsic* intrin = (*use)->AsHWIntrinsic(); | ||
|
|
||
| if (HWIntrinsicInfo::ReturnsPerElementMask(intrin->GetHWIntrinsicId()) && !intrin->TypeIs(TYP_MASK)) |
There was a problem hiding this comment.
This is not correct for all platforms and is going to regress xarch as well as be incorrect for hardware without TYP_MASK support. I imagine it may also regress optimization opportunities for AdvSimd on Arm64.
The consideration is that many node kinds return a per element mask and are explicitly not returning TYP_MASK. For example, Vector128.GreaterThan is an API which explicitly returns a vector, but where it is conceptually known to be a "per-element mask" (i.e. each element is either AllBitsSet or Zero).
Having this knowledge, even pre-SVE or pre-AVX512, where no dedicated TYP_MASK support exists, is beneficial as it allows unlocking a number of other optimization opportunities and folding operations that may not be otherwise valid. -- These optimizations are notably missing from Arm64, in part because the SVE predication feature has deviated a lot from the general support.
Rather, we only want to do such a transform if we have TYP_MASK support and its going to emit an instruction that actually produces a TYP_MASK, not that is just "conceptually" a mask. In the case of xarch we do so by marking the downlevel instructions as special-import and adjusting those as needed; there is explicitly no need to do traversals of the tree since we know that we are either producing a mask (and therefore need CvtMaskToVector) or we are expecting a mask (and therefore need CvtVectorToMask).
It's unclear then why Arm64 needs to do a tree traversal itself here as it should have the same general scenario. Any given intrinsic is one of three categories (does nothing with masks, produces a mask, or consumes a mask) and so it should be trivially handled without any consideration of tree traversal.
There was a problem hiding this comment.
-- The code here is only called from Arm64, but the ifdef doesn't cover that; nor does the summary or visitor make it clear its only valid for Arm64; so future readers or refactorings may miss the consideration.
But then it's very unclear to me why we need this setup and why it needs to deviate from what's already trivially working for other platforms with masking support.
There was a problem hiding this comment.
Sorry, I should've left a comment with more context.
I was having trouble with implementing the MaxMagnitude intrinsic, see this import code:
The intrinsics in this tree don't have mask types assigned and tend to cause assertions in Lowering when inserting implicit mask operands. I decided that rather than require the author of this sort of algorithm to maintain the TYP_MASK consistency for Arm64, I would add a pass based on HW_Flag_ReturnsPerElementMask to enforce that instead. This would allow you to write short algorithms in import with correct types per the CIL and have the visitor apply the types for a small runtime cost.
-- The code here is only called from Arm64, but the ifdef doesn't cover that; nor does the summary or visitor make it clear its only valid for Arm64; so future readers or refactorings may miss the consideration.
This is my mistake, it was only intended for Arm64 and I need to fix the ifdef. I will clarify the documentation too.
These optimizations are notably missing from Arm64, in part because the SVE predication feature has deviated a lot from the general support.
@a74nh has made a good amount of progress on this recently. We're building an abstraction of a constant in terms of {pattern, value} for this. For example, strength reduction of {any pattern, any value} & {repeated, 0} => {repeated, 0}. The current way things are done I think fits in this hierarchy, just where the pattern is a 'single scalar'.
I can see how SVE has overloaded the meaning of FEATURE_MASKED_HW_INTRINSICS and HW_Flag_ReturnsPerElementMask, really it's a different feature. I think we'll want to reconcile that in future.
There was a problem hiding this comment.
I decided that rather than require the author of this sort of algorithm to maintain the TYP_MASK consistency for Arm64
We notably handle this scenario on xarch by having the user visible API as HW_Flag_InvalidNodeId and having a different internal only intrinsic ID that is expected to have the mask. For example, we have NI_AVX512_CompareEqual which matches the managed API surface returning a Vector512<T> and then NI_AVX512_CompareEqualMask which is the internal API returning a TYP_MASK.
This helps ensure we never produce the vector returning ID in IR (as it triggers an assert when setting the ID here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L30507-L30531), which helps force users to think about the correct shape.
However, beyond that we also have the GetHWIntrinsicIdFor* and GetLookupTypeFor* helpers, which Arm64 isn't participating in right now (part of the deviation that probably shouldn't exist).
Note for example how gtNewSimdCmpOpNode calls GetLookupTypeForCmpOp which forces the type to TYP_MASK if AVX512 is supported, helping to canonicalize the support.
Then GetHWIntrinsicIdForCmpOp handles this and knows to return say NI_AVX512_CompareEqualMask instead of NI_X86Base_CompareEqual for a 128-bit comparison, guaranteeing the IR is correct since we then know to insert CvtMaskToVectorNode since the lookup type (mask) and actual type (simd) mismatch.
If Arm64 just participates in these existing helpers, then there's no need to have a custom visitor or different logic for SVE, it "just works" with all the existing support in the JIT. It also avoids issues if something like gtNewSimdMinMaxNode is used outside of import, which is very possible for many of the other helpers (especially due to morph or other phases doing optimizations).
There was a problem hiding this comment.
Generally speaking, I'd expect the minimum changes here be that Arm64 has GetLookupTypeFor* return TYP_MASK for any SVE intrinsic that is flagged HW_Flag_ReturnsPerElementMask and for it to actually return NI_Sve_* intrinsics from GetHWIntrinsicIdFor* when the simd size is "unknown".
This will cause almost all the existing helpers to light up and participate in all the general optimizations and to be correct by construction, rather than relying on developers to manually ensure any given instantiation is valid.
There was a problem hiding this comment.
Thanks for the clarification, so nothing needs to be done there then. Just need to be careful with the implementation of Vector.ConditionalSelect.
One last point, is TYP_MASK supposed to be semantically bit-wise or element-wise? Currently on Arm64 it is always an element-wise representation as it is backed by a predicate register, and this could be another source of difference between architectures/ISAs?
There was a problem hiding this comment.
TYP_MASK is expected to be elementwise, that is it uses a single bit (in some architecture specific layout) to indicate "true" or "false" for a given element.
There's then a subtle edge case here that differs based on how a given architectures does its bit layout. xarch for example simply uses the lowest count bits, so a Vector128<int> and a Vector256<long> are both represented using 0bWZYX, as each requires 4 bits. arm64 on the other hand reads every nth bit, where n is based on sizeof(T) and so it would use 0bY000_X000 for Vector64<int> but 0bY000_0000_X000_0000 for Vector128<long>. -- noting this was pseudo-syntax since v64/128 don't actually support predication, but if the SVE scalable size matched those, then that is how it would be used.
What this means is that the ways in which TYP_MASK can be used across distinct operations may differ slightly and the places it may need to be widened or narrowed can vary -- i.e. if you had a predicate produced by Vector.GreaterThan<int> and then wanted to use it alongside some Vector.Add<long> operation, then the way you achieve that or whether it is implicitly safe varies between them.
There was a problem hiding this comment.
arm64 on the other hand reads every nth bit, where n is based on sizeof(T) and so it would use 0bY000_X000 for Vector64 but 0bY000_0000_X000_0000 for Vector128.
This representation is what we assume for conversions between masks and vectors at runtime, see here:
runtime/src/coreclr/jit/hwintrinsiccodegenarm64.cpp
Lines 2119 to 2129 in ac88c72
This codegen doesn't seem consistent with the way the optimizations handle element-wise masks. Element-wise masks are converted to bit-wise masks and 'AllBitsSet' is treated as an active lane. Our conversion codegen is producing the integer '1' instead.
In the other direction converting from bit-wise form to element-wise form, we accept any value that is not equal to zero as a sign that the element is active. That means if you take some bitwise-wise mask and perform any function on it that results in a value other than zero, the lane will still be active when converted back into element-wise form.
I'm concerned this could cause inconsistencies between optimized code and unoptimized code? For example, due to the way I have changed CreateTrueMask* to import a mask constant instead of a vector constant in this PR, it seems to be emitting an unoptimized SVE sequence now rather than the optimized NEON sequence:
- mvni v16.4s, #0
+ ptrue p0.b
+ mov z16.b, p0/z, #1
The old sequence will create a vector of -1, however the new sequence will create a vector of 1 because of the conversion introduced.
IIUC the contract you've used so far for other architectures is:
bit(i) = 1 (element-wise) <==> element(i) = AllBitsSet (bit-wise)
and I think we need to make sure we adhere to this for SVE, so that we can maintain consistency with the general optimizations.
There was a problem hiding this comment.
We definitely need to ensure that the behavior is consistent between optimized and unoptimized code, it would be a bug otherwise. -- it is certainly possible there are existing latent bugs due to platform differences or someone misunderstanding the requirements when adding code.
ConvertMaskToVector and ConvertVectorToMask are internal helpers because we don't have a managed predicate type and so no signatures directly take predicates. Since the xplat APIs are all documented in terms of vectors, not predicates, there's really no consideration or IR transforms required. Its rather only when we start producing platform specific IR that we need to insert these, because some instructions actually return or take predicates.
Given that, we require ConvertMaskToVector do a conversion such that the vector is AllBitsSet (active, true) or Zero (inactive, false) on a per-element basis. This is required, for example, so that Vector.GreaterThan can be transformed into Sve.CompareGreaterThan and it still functions as intended.
ConvertVectorToMask, on the other hand, is a little bit different. For starters we don't have any xplat APIs that actually take masks, rather the closest thing today is Vector.ConditionalSelect which is bitwise instead. Because of this it largely only comes into play for platform specific APIs, such as Sve.Compact, where one or more of the operands is required to be a predicate. -- If we ever expose such APIs in the future, like say a Vector.MaskedLoad API, then we'd need to decide on the behavior at that time. But its a bridge to cross when we get to it.
This also means we do not have as stringent of a behavior for ConvertVectorToMask and it can do something different between architectures. In general we want to do it however is deemed most optimal, and can differ between architectures (xarch vs arm64), but it must at least handle 0 -> inactive and AllBitsSet -> active. For xarch, as an example, we use the vpmov*2m instructions which check if the most significant bit is set, and so an API like Avx512F.BlendVariable(Vector128<int> trueVal, Vector128<int> falseVal, Vector128<int> mask) will select the element from trueVal for any (mask[i] & 0x8000_0000) != 0. This is fine since it works as expected when a given element is AllBitsSet (most significant bit is set) or Zero. Users may be surprised that it won't work if they happen to have the mask element be something like 0x7FFF_FFFF, but that's okay because they explicitly used the platform specific API and we say it has that behavior.
For Arm64, we then have to make a similar decision on how to do the conversion. The current implementation is effectively doing mask[i] != 0, which is fine since it covers the scenarios required. If we want to use PMOV then we need to ensure that the behavior is then functionally identical to mask[i] != 0 and otherwise we would need to update the non-PMOV fallback mirror PMOV prior to shipping Sve as stable.
Given the above, it is then okay if ConvertVectorToMask(ConvertMaskToVector(x)) or ConvertMaskToVector(ConvertVectorToMask(x)) does not produce exactly x since that should only be encountered in places where we already knew that the IR could consume such a shape. Somewhere requiring x be bitwise preserved would be representative of an illegal JIT optimization.
There was a problem hiding this comment.
We definitely need to ensure that the behavior is consistent between optimized and unoptimized code, it would be a bug otherwise. -- it is certainly possible there are existing latent bugs due to platform differences or someone misunderstanding the requirements when adding code.
OK, in which case I believe the codegen for ConvertMaskToVector needs fixing to produce AllBitsSet, which I'll also look into.
If we want to use PMOV then we need to ensure that the behavior is then functionally identical to mask[i] != 0 and otherwise we would need to update the non-PMOV fallback mirror PMOV prior to shipping Sve as stable.
I believe PMOV will satisfy this, but unfortunately not the other conversion direction. If we did want to support PMOV for the stronger ConvertMaskVector, we'd need to add support for this representation in the optimization code. We may be better off using predicated MOV for the mask->vector direction but PMOV for the vector->mask direction.
Thanks again for the detailed answers, this thread has been useful for my understanding.
When wrapping an intrinsic node that has an embedded mask with a
ConditionalSelect, ensure that the constant node in op3 has a mask type when the intrinsic has theHW_Flag_ReturnsPerElementMaskflag.Build out further support for
ConditionalSelect_Predicates, and use this to wrap nodes withHW_Flag_ReturnsPerElementMask. AddGenTree::IsSelectZeroand update various areas in HW intrinsic codegen to ensure this intrinsic assembles correctly.