[fix](be) Avoid mutating shared Variant columns#64094
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Fixes a correctness issue in BE local-shuffle hashing for nullable columns where a logically-const CRC32C hash computation could mutate the nullable column’s nested data, potentially leaking the mutation to other consumers when the column instance is shared.
Changes:
- Update
ColumnNullable::update_crc32c_batch()to avoid in-place mutation by hashing a temporary cloned nested column only when nulls are present (preserving the no-null fast path). - Add a BE unit test to assert
update_crc32c_batch()does not mutate the original nested column values.
Checkpoint Conclusions (per core review guide)
- Goal & correctness: The change directly prevents mutation leakage by ensuring normalization happens on a temporary column; the new UT reproduces and guards the behavior.
- Concurrency / sharing safety: The updated code no longer mutates
_nested_columninside aconstmethod, removing the primary aliasing/thread-safety hazard described in the PR. - Performance considerations: No-null path remains fast. Null-present path now incurs a clone of the nested column; this is a deliberate tradeoff for correctness and isolation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| be/src/core/column/column_nullable.cpp | Stop mutating the stored nested column during CRC32C hashing; clone-and-normalize only when nulls are present. |
| be/test/core/column/column_nullable_test.cpp | Add UT asserting CRC32C batch hashing does not mutate the nullable nested column values. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
TPC-H: Total hot run time: 29019 ms |
TPC-DS: Total hot run time: 168466 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
597f79e to
7f17d77
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review result: no blocking issues found.
Critical checkpoint conclusions:
- Goal and tests: The PR addresses the stated issue by removing the in-place mutation from nullable CRC32C hashing while preserving default-value hashing for fixed-width nested columns. Added BE unit tests cover int64 and decimal nullable columns and verify nested data is not mutated.
- Scope and clarity: The change is small and focused on the CRC32C nullable hash path, with direct vector/decimal hot-path implementations and an isolated fallback.
- Concurrency and lifecycle: The change reduces shared-column mutation risk. No new shared mutable state, locks, threads, static initialization, or lifecycle ownership concerns were introduced.
- Configuration and compatibility: No new config, persistence, storage-format, protocol, or rolling-upgrade compatibility concern was introduced.
- Parallel paths: The direct CRC32C partitioning callers continue to use update_crc32c_batch. Non-CRC hash paths are unchanged.
- Conditional checks: The new has_null fast path and support_replace_column_null_data branch are consistent with existing nullable-column patterns.
- Test coverage: Focused unit tests were added. I attempted ./run-be-ut.sh --run --filter=
There was a problem hiding this comment.
Review result: no blocking issues found.
Critical checkpoint conclusions:
- Goal and tests: The PR addresses the stated issue by removing the in-place mutation from nullable CRC32C hashing while preserving default-value hashing for fixed-width nested columns. Added BE unit tests cover int64 and decimal nullable columns and verify nested data is not mutated.
- Scope and clarity: The change is small and focused on the CRC32C nullable hash path, with direct vector/decimal hot-path implementations and an isolated fallback.
- Concurrency and lifecycle: The change reduces shared-column mutation risk. No new shared mutable state, locks, threads, static initialization, or lifecycle ownership concerns were introduced.
- Configuration and compatibility: No new config, persistence, storage-format, protocol, or rolling-upgrade compatibility concern was introduced.
- Parallel paths: The direct CRC32C partitioning callers continue to use update_crc32c_batch. Non-CRC hash paths are unchanged.
- Conditional checks: The new has_null fast path and support_replace_column_null_data branch are consistent with existing nullable-column patterns.
- Test coverage: Focused unit tests were added. I attempted ./run-be-ut.sh --run --filter=ColumnNullableTest.UpdateCrc32cBatch*, but this runner could not complete BE UT setup because thirdparty/installed/bin/protoc was missing during gensrc generation. This occurred before compiling the touched code.
- Test result files: No regression output files are involved.
- Observability: No new runtime observability appears necessary for this localized hash correctness fix.
- Transactions and persistence: Not applicable.
- Data writes and modification: The key correctness improvement is avoiding mutation of nested nullable column data during logically const hashing.
- FE/BE variable passing: Not applicable.
- Performance: Vector and decimal hot paths avoid cloning and retain direct hashing. The no-null path avoids null-map processing. I did not identify a concrete performance regression requiring changes.
- User focus: No additional user-provided review focus was present.
I also reviewed existing inline review context and found no prior inline comments to account for.
TPC-H: Total hot run time: 29426 ms |
TPC-DS: Total hot run time: 169348 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
7f17d77 to
4e2cf50
Compare
|
run buildall |
4e2cf50 to
691c177
Compare
|
run buildall |
691c177 to
df1a913
Compare
|
run buildall |
|
/review |
df1a913 to
be03db5
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Reviewed the actual GitHub PR diff for PR #64094, which contains the Variant mutation fix and its BE unit tests. I did not find blocking issues.
Critical checkpoint conclusions:
- Goal and proof: The change prevents cast/serialization from finalizing shared
ColumnVariantinputs in place and fixes range copy over pending default suffixes. Added BE tests directly cover source non-mutation and suffix materialization. - Scope: The actual PR diff is focused on Variant code and related tests.
- Concurrency: The relevant concurrency concern is shared blocks from local exchange. The changed cast and serialization paths now finalize private clones before operations that require finalized columns, so they avoid mutating shared input columns.
- Lifecycle/static initialization: No new static/global lifecycle concerns found.
- Configuration: No config items added.
- Compatibility: Serialized Variant layout is unchanged; byte-size and serialize paths were updated consistently to use finalized private copies.
- Parallel code paths:
get_uncompressed_serialized_bytes,serialize, and string/JSONB cast paths were covered. I did not find another Variant finalization path in this diff that needs the same adjustment. - Conditional checks: The pending-default suffix insertion is bounded by the requested range and the added tests cover both partial suffix and suffix-only copies.
- Test coverage: Unit tests were added for
Subcolumn::insert_range_from, block/datatype serialization source non-mutation, clone-finalized deep copy, and Variant cast source non-mutation. I did not run tests in this review runner. - Test results: No
.outresult files are involved. - Observability: No new observability appears necessary for this internal correctness fix.
- Transactions/persistence/data writes: Not involved.
- FE-BE variables/protocol: Not involved.
- Performance: The extra clone happens only for unfinalized Variant inputs on paths that previously mutated the source; this is a reasonable correctness tradeoff for shared-block safety.
User focus points: No additional user-provided review focus was specified.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
be03db5 to
7df8a51
Compare
|
run buildall /review |
|
/review |
There was a problem hiding this comment.
I found one blocking correctness issue in the new Variant cast copy-on-write path. The change is focused and the tests cover several non-mutating/finalized cases, but the prefix execution edge below can crash or read out of bounds.
Critical checkpoint conclusions:
- Goal/test: The PR aims to avoid mutating shared Variant columns during serialization/casts, and it adds BE unit coverage for common serialization and cast paths. The zero-row prefix cast edge is not covered.
- Scope: The code changes are mostly focused, but the cast path now creates zero-row finalized Variant columns before calling helpers that assume non-empty input.
- Concurrency/COW: The main local-exchange shared-block hazard is addressed by private finalized copies, but the new copy/resize ordering introduces the issue below.
- Lifecycle/static/config: No new static lifecycle or configuration concerns found.
- Compatibility: No serialized layout/protocol compatibility changes found.
- Parallel paths: Serialization and cast paths were both considered; the issue is specific to cast prefix handling.
- Tests: Added tests are useful but miss
input_rows_count == 0for non-finalized Variant casts. - Observability/transactions/writes: Not applicable.
- Performance: The private clone-finalize approach can add work, but I did not find a blocking performance regression beyond the correctness issue.
User focus: No additional user-provided review focus was specified.
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Local exchange and join execution can share input blocks across downstream tasks. Variant cast and block serialization finalized ColumnVariant in place, and Subcolumn::insert_range_from could leave the lazy-default suffix unmaterialized when copying ranges. For local-shuffle anti-join queries that evaluate Variant path casts, one task could mutate a shared Variant column while another task is still reading it, leading to unstable results or range-copy failures. This change finalizes private deep copies for Variant cast/serialization paths, trims serialized Variant cast inputs to the requested input row prefix, and materializes pending defaults during range copy.
The cast path must also handle empty prefixes and legacy root-only unfinalized Variant columns. An empty prefix can otherwise create a zero-row ColumnVariant and then call helpers that assume num_rows > 0. Root-only unfinalized Variant test columns can also have a semantic input row count greater than ColumnVariant::size(), so checking the requested rows against ColumnVariant::size() can crash even though the root column contains the rows being cast.
The fix was reproduced with Variant red tests: the old code finalized source Variant columns during cast/serialization, failed prefix Variant-to-JSONB casts on private finalized copies, failed already-finalized prefix Variant-to-JSONB casts, crashed CastFromVariant on a root-only unfinalized Variant column, and failed to copy pending defaults. The same tests pass after the change. A local four-BE cluster also verified the affected local-shuffle anti-join query with Variant expressions and a non-Variant control query on the same plan shape.
### Release note
Fix an issue where local-shuffle queries using VARIANT expressions could return unstable results or fail.
### Check List (For Author)
- Test:
- BE Unit Test red/green: ./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant' reproduced the root-only unfinalized Variant crash before the CastFromVariant guard fix and passed after it
- BE Unit Test: ./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
- BE Unit Test: ./run-be-ut.sh --run --filter='ColumnVariantTest.insert_range_from_materializes_pending_default_suffix:ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromFinalizedVariantJsonbPrefix:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
- Build: ./build.sh --be
- Format: PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.sh, git diff --check
- Manual test: local four-BE cluster, 800/800 Variant local-shuffle anti-join queries passed, 800/800 non-Variant control queries passed
- Behavior changed: Yes. Variant cast and serialization no longer mutate shared source columns.
- Does this need documentation: No
7df8a51 to
5bad3f7
Compare
|
Addressed the latest blocking review and the Summary:
Verified locally:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review summary:
No additional blocking issues found in this pass. I reviewed the changed Variant serialization, cast-from-Variant prefix handling, Subcolumn range-copy default materialization, and the added BE unit tests. The existing inline thread about zero-row helper access is already known review context; the current head contains an early zero-row return, so I did not duplicate it.
Critical checkpoint conclusions:
- Goal and tests: The change targets in-place finalization of shared ColumnVariant inputs in cast/serialization paths and pending default suffix handling. The code uses private finalized copies and adds focused BE UT coverage for serialization, cast prefixes, nullable inputs, and default suffix copying.
- Scope and clarity: The modification is focused on the affected Variant paths and tests.
- Concurrency: The relevant concurrency risk is shared Block/ColumnVariant observation across local exchange consumers. The reviewed paths avoid mutating shared Variant inputs by finalizing private copies. No new locks or lock-order concerns were introduced.
- Lifecycle/static initialization: No new special lifecycle or cross-translation-unit static initialization risk observed.
- Configuration: No new configuration items.
- Compatibility: No serialized format/schema change observed; serialization still writes the same fields.
- Parallel code paths: Cast-from-Variant string/JSONB and DataTypeVariant byte-size/serialize paths were both updated.
- Conditional checks: Prefix and zero-row handling are explicit and covered by tests.
- Test coverage/results: Added BE UTs cover the main changed paths. I attempted a targeted run with
./run-be-ut.sh --run --filter=..., but this runner failed during gensrc setup before tests because/home/runner/work/doris/doris/thirdparty/installed/bin/protocis missing. - Observability: No new observability appears necessary for this internal correctness fix.
- Transaction/persistence/data-write correctness: Not applicable; this PR does not alter transaction, persistence, visible-version, or write commit logic.
- FE/BE protocol variables: Not applicable.
- Performance: The fix intentionally trades extra private copies for correctness in serialization/cast paths. I did not find an avoidable correctness-impacting performance issue.
User focus points: No additional user-provided review focus was specified.
TPC-H: Total hot run time: 29239 ms |
TPC-DS: Total hot run time: 168792 ms |
What problem does this PR solve?
Issue Number: None
Problem Summary:
This PR fixes Variant cast and serialization paths that finalized shared
ColumnVariantinputs in place. In local-shuffle anti-join/profile query execution, the same Variant column can be observed by multiple consumers, so in-place finalization mutates shared block state and can produce inconsistent query results.End-to-end reproduction and validation:
Root cause:
Variant cast and serialization code finalized
ColumnVariantinputs in place. That changed shared column state visible to later consumers. A separate range-copy path could also leave lazy default rows pending for a copied Variant subcolumn suffix.The cast path also needs to handle empty prefixes and root-only unfinalized Variant columns. A zero-row prefix can otherwise create a zero-row
ColumnVariantand then call helpers that assumenum_rows > 0. The reportedFunctionVariantCast.CastFromVariantcrash was caused by comparing the requested input rows againstColumnVariant::size()for a root-only unfinalized Variant test column whose root contains rows whileColumnVariant::size()is 0.Fix:
ColumnVariant::size()as the semantic row-count guard for root-only unfinalized Variant inputs.Subcolumn::insert_range_from.Release note
None
Check List (For Author)
./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'./run-be-ut.sh --run --filter='ColumnVariantTest.insert_range_from_materializes_pending_default_suffix:ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromFinalizedVariantJsonbPrefix:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.shgit diff --checkgit diff --check upstream/master...HEADenv -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY ./build.sh --be