Skip to content

[fix](be) Avoid mutating shared Variant columns#64094

Closed
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:branch-fix-nullable-hash-no-mutation
Closed

[fix](be) Avoid mutating shared Variant columns#64094
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:branch-fix-nullable-hash-no-mutation

Conversation

@eldenmoon

@eldenmoon eldenmoon commented Jun 3, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: None

Problem Summary:
This PR fixes Variant cast and serialization paths that finalized shared ColumnVariant inputs in place. In local-shuffle anti-join/profile query execution, the same Variant column can be observed by multiple consumers, so in-place finalization mutates shared block state and can produce inconsistent query results.

End-to-end reproduction and validation:

  • Reproduced the failure with a local multi-BE setup using the affected Variant anti-join/local-shuffle query shape.
  • Verified a non-Variant control using the same local-shuffle/anti-join shape does not reproduce the issue.
  • After the fix, the Variant repro passed 800/800 runs on a local 4-BE cluster and returned the expected result with profile enabled.
  • The non-Variant control also passed 800/800 runs, confirming this fix is Variant-specific rather than a generic local-shuffle change.

Root cause:
Variant cast and serialization code finalized ColumnVariant inputs in place. That changed shared column state visible to later consumers. A separate range-copy path could also leave lazy default rows pending for a copied Variant subcolumn suffix.

The cast path also needs to handle empty prefixes and root-only unfinalized Variant columns. A zero-row prefix can otherwise create a zero-row ColumnVariant and then call helpers that assume num_rows > 0. The reported FunctionVariantCast.CastFromVariant crash was caused by comparing the requested input rows against ColumnVariant::size() for a root-only unfinalized Variant test column whose root contains rows while ColumnVariant::size() is 0.

Fix:

  • Use private finalized copies in Variant serialization paths.
  • Avoid mutating Variant source columns during cast from Variant.
  • Handle prefix casts for nullable and non-nullable Variant inputs, including zero-row prefixes.
  • Avoid treating ColumnVariant::size() as the semantic row-count guard for root-only unfinalized Variant inputs.
  • Materialize pending default suffix rows in Subcolumn::insert_range_from.

Release note

None

Check List (For Author)

  • Test: Regression test / Unit Test / Manual test
    • ./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
    • ./run-be-ut.sh --run --filter='ColumnVariantTest.insert_range_from_materializes_pending_default_suffix:ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromFinalizedVariantJsonbPrefix:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
    • PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.sh
    • git diff --check
    • git diff --check upstream/master...HEAD
    • env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY ./build.sh --be
    • Local 4-BE E2E Variant query: 800/800 passed, expected result 0.
    • Local 4-BE E2E non-Variant control: 800/800 passed, expected result 0.
    • Local 4-BE profile-enabled Variant query returned expected result 0.
    • Independent Codex subagent review: no blocking findings.
  • Behavior changed: No
  • Does this need documentation: No

Copilot AI review requested due to automatic review settings June 3, 2026 20:46
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@eldenmoon

Copy link
Copy Markdown
Member Author

run buildall

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a correctness issue in BE local-shuffle hashing for nullable columns where a logically-const CRC32C hash computation could mutate the nullable column’s nested data, potentially leaking the mutation to other consumers when the column instance is shared.

Changes:

  • Update ColumnNullable::update_crc32c_batch() to avoid in-place mutation by hashing a temporary cloned nested column only when nulls are present (preserving the no-null fast path).
  • Add a BE unit test to assert update_crc32c_batch() does not mutate the original nested column values.

Checkpoint Conclusions (per core review guide)

  • Goal & correctness: The change directly prevents mutation leakage by ensuring normalization happens on a temporary column; the new UT reproduces and guards the behavior.
  • Concurrency / sharing safety: The updated code no longer mutates _nested_column inside a const method, removing the primary aliasing/thread-safety hazard described in the PR.
  • Performance considerations: No-null path remains fast. Null-present path now incurs a clone of the nested column; this is a deliberate tradeoff for correctness and isolation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
be/src/core/column/column_nullable.cpp Stop mutating the stored nested column during CRC32C hashing; clone-and-normalize only when nulls are present.
be/test/core/column/column_nullable_test.cpp Add UT asserting CRC32C batch hashing does not mutate the nullable nested column values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29019 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 597f79e343f65c08699d57d83874c6f33c2f140c, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17705	4108	4024	4024
q2	q3	10835	1420	863	863
q4	4684	477	351	351
q5	7567	876	582	582
q6	185	172	138	138
q7	768	828	648	648
q8	9352	1583	1600	1583
q9	5747	4503	4463	4463
q10	6615	1789	1541	1541
q11	428	274	256	256
q12	637	427	297	297
q13	18139	3331	2795	2795
q14	262	256	245	245
q15	q16	789	778	713	713
q17	944	938	878	878
q18	6803	5685	5582	5582
q19	1330	1257	1061	1061
q20	520	392	265	265
q21	5776	2563	2430	2430
q22	442	354	304	304
Total cold run time: 99528 ms
Total hot run time: 29019 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4308	4209	4338	4209
q2	q3	4499	4913	4303	4303
q4	2116	2185	1372	1372
q5	4449	4275	4248	4248
q6	232	170	129	129
q7	1744	1629	1468	1468
q8	2998	2423	2191	2191
q9	8125	8120	8206	8120
q10	4787	4758	4349	4349
q11	578	419	368	368
q12	754	753	541	541
q13	3273	3612	2961	2961
q14	303	313	270	270
q15	q16	731	737	625	625
q17	1358	1308	1312	1308
q18	8074	7181	7259	7181
q19	1166	1138	1087	1087
q20	2225	2214	1966	1966
q21	5204	4527	4400	4400
q22	524	484	395	395
Total cold run time: 57448 ms
Total hot run time: 51491 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 168466 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 597f79e343f65c08699d57d83874c6f33c2f140c, data reload: false

query5	4327	642	478	478
query6	450	213	174	174
query7	4879	555	298	298
query8	363	223	212	212
query9	8759	3989	4029	3989
query10	443	331	286	286
query11	5878	2335	2127	2127
query12	153	104	100	100
query13	1242	629	419	419
query14	6418	5341	4993	4993
query14_1	4340	4349	4328	4328
query15	205	192	177	177
query16	989	454	413	413
query17	921	691	569	569
query18	2415	455	340	340
query19	193	181	138	138
query20	107	115	106	106
query21	218	151	120	120
query22	13653	13596	13351	13351
query23	17346	16493	16187	16187
query23_1	16162	16284	16316	16284
query24	7542	1758	1317	1317
query24_1	1314	1308	1308	1308
query25	595	487	407	407
query26	1298	334	173	173
query27	2690	550	349	349
query28	4498	2008	2038	2008
query29	1099	627	512	512
query30	313	225	198	198
query31	1124	1066	947	947
query32	125	63	61	61
query33	532	327	260	260
query34	1188	1133	654	654
query35	754	793	692	692
query36	1379	1431	1267	1267
query37	156	105	90	90
query38	3214	3120	3037	3037
query39	938	921	898	898
query39_1	883	892	879	879
query40	224	125	105	105
query41	70	68	68	68
query42	97	99	97	97
query43	319	319	279	279
query44	
query45	200	195	194	194
query46	1111	1237	756	756
query47	2363	2351	2257	2257
query48	418	405	324	324
query49	664	491	386	386
query50	1010	353	261	261
query51	4363	4288	4229	4229
query52	89	95	83	83
query53	260	271	196	196
query54	283	239	225	225
query55	81	77	74	74
query56	256	240	223	223
query57	1423	1428	1351	1351
query58	268	220	217	217
query59	1597	1681	1443	1443
query60	300	258	235	235
query61	218	162	155	155
query62	694	656	582	582
query63	228	183	182	182
query64	2581	765	623	623
query65	
query66	1836	454	337	337
query67	29541	29489	29458	29458
query68	
query69	418	303	266	266
query70	946	933	937	933
query71	300	219	214	214
query72	3027	2736	2382	2382
query73	857	733	453	453
query74	5115	4917	4797	4797
query75	2653	2547	2217	2217
query76	2350	1163	757	757
query77	360	375	279	279
query78	12402	12424	11839	11839
query79	1481	1038	719	719
query80	863	461	400	400
query81	491	278	238	238
query82	582	159	126	126
query83	366	277	244	244
query84	304	145	111	111
query85	926	527	459	459
query86	412	289	272	272
query87	3362	3293	3164	3164
query88	3598	2740	2732	2732
query89	475	382	331	331
query90	1841	177	179	177
query91	178	165	135	135
query92	63	62	59	59
query93	1465	1420	906	906
query94	626	349	284	284
query95	672	486	344	344
query96	1076	798	394	394
query97	2703	2720	2574	2574
query98	209	204	208	204
query99	1146	1154	1037	1037
Total cold run time: 251287 ms
Total hot run time: 168466 ms

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (5/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.90% (21026/39006)
Line Coverage 37.57% (199803/531792)
Region Coverage 33.67% (156771/465638)
Branch Coverage 34.63% (68576/198005)

@eldenmoon eldenmoon force-pushed the branch-fix-nullable-hash-no-mutation branch from 597f79e to 7f17d77 Compare June 3, 2026 22:05
@eldenmoon

Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found.

Critical checkpoint conclusions:

  • Goal and tests: The PR addresses the stated issue by removing the in-place mutation from nullable CRC32C hashing while preserving default-value hashing for fixed-width nested columns. Added BE unit tests cover int64 and decimal nullable columns and verify nested data is not mutated.
  • Scope and clarity: The change is small and focused on the CRC32C nullable hash path, with direct vector/decimal hot-path implementations and an isolated fallback.
  • Concurrency and lifecycle: The change reduces shared-column mutation risk. No new shared mutable state, locks, threads, static initialization, or lifecycle ownership concerns were introduced.
  • Configuration and compatibility: No new config, persistence, storage-format, protocol, or rolling-upgrade compatibility concern was introduced.
  • Parallel paths: The direct CRC32C partitioning callers continue to use update_crc32c_batch. Non-CRC hash paths are unchanged.
  • Conditional checks: The new has_null fast path and support_replace_column_null_data branch are consistent with existing nullable-column patterns.
  • Test coverage: Focused unit tests were added. I attempted ./run-be-ut.sh --run --filter=

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found.

Critical checkpoint conclusions:

  • Goal and tests: The PR addresses the stated issue by removing the in-place mutation from nullable CRC32C hashing while preserving default-value hashing for fixed-width nested columns. Added BE unit tests cover int64 and decimal nullable columns and verify nested data is not mutated.
  • Scope and clarity: The change is small and focused on the CRC32C nullable hash path, with direct vector/decimal hot-path implementations and an isolated fallback.
  • Concurrency and lifecycle: The change reduces shared-column mutation risk. No new shared mutable state, locks, threads, static initialization, or lifecycle ownership concerns were introduced.
  • Configuration and compatibility: No new config, persistence, storage-format, protocol, or rolling-upgrade compatibility concern was introduced.
  • Parallel paths: The direct CRC32C partitioning callers continue to use update_crc32c_batch. Non-CRC hash paths are unchanged.
  • Conditional checks: The new has_null fast path and support_replace_column_null_data branch are consistent with existing nullable-column patterns.
  • Test coverage: Focused unit tests were added. I attempted ./run-be-ut.sh --run --filter=ColumnNullableTest.UpdateCrc32cBatch*, but this runner could not complete BE UT setup because thirdparty/installed/bin/protoc was missing during gensrc generation. This occurred before compiling the touched code.
  • Test result files: No regression output files are involved.
  • Observability: No new runtime observability appears necessary for this localized hash correctness fix.
  • Transactions and persistence: Not applicable.
  • Data writes and modification: The key correctness improvement is avoiding mutation of nested nullable column data during logically const hashing.
  • FE/BE variable passing: Not applicable.
  • Performance: Vector and decimal hot paths avoid cloning and retain direct hashing. The no-null path avoids null-map processing. I did not identify a concrete performance regression requiring changes.
  • User focus: No additional user-provided review focus was present.

I also reviewed existing inline review context and found no prior inline comments to account for.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29426 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7f17d77d1fc4d7ab61a662a98c48cb074c758b20, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17859	4052	4047	4047
q2	q3	10762	1416	815	815
q4	4730	474	337	337
q5	7924	890	581	581
q6	211	173	138	138
q7	836	821	651	651
q8	10439	1598	1688	1598
q9	7082	4519	4540	4519
q10	6804	1838	1551	1551
q11	433	269	252	252
q12	641	429	288	288
q13	18141	3340	2757	2757
q14	269	256	241	241
q15	q16	819	786	707	707
q17	1027	963	892	892
q18	7002	5782	5610	5610
q19	1252	1343	1154	1154
q20	528	446	368	368
q21	6030	2723	2610	2610
q22	453	371	310	310
Total cold run time: 103242 ms
Total hot run time: 29426 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4761	4650	4905	4650
q2	q3	4853	5229	4658	4658
q4	2121	2175	1396	1396
q5	4911	4813	4688	4688
q6	233	178	127	127
q7	1862	1730	1586	1586
q8	2389	2061	1911	1911
q9	7471	7395	7335	7335
q10	4716	4661	4174	4174
q11	540	386	362	362
q12	729	752	520	520
q13	3031	3382	2784	2784
q14	277	285	252	252
q15	q16	683	699	603	603
q17	1271	1250	1240	1240
q18	7246	6943	6829	6829
q19	1146	1103	1141	1103
q20	2219	2223	1950	1950
q21	5250	4538	4454	4454
q22	523	469	396	396
Total cold run time: 56232 ms
Total hot run time: 51018 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 169348 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7f17d77d1fc4d7ab61a662a98c48cb074c758b20, data reload: false

query5	4326	622	470	470
query6	443	197	179	179
query7	4865	557	298	298
query8	376	230	216	216
query9	8777	4163	4147	4147
query10	474	315	260	260
query11	5926	2342	2191	2191
query12	157	101	98	98
query13	1278	609	433	433
query14	6381	5375	5046	5046
query14_1	4360	4368	4391	4368
query15	203	194	175	175
query16	1036	463	430	430
query17	1102	733	603	603
query18	2506	483	376	376
query19	223	194	150	150
query20	112	108	103	103
query21	216	135	120	120
query22	13591	13515	13390	13390
query23	17324	16512	16146	16146
query23_1	16333	16346	16384	16346
query24	7693	1809	1283	1283
query24_1	1295	1284	1316	1284
query25	583	483	411	411
query26	1351	317	174	174
query27	2667	585	338	338
query28	4494	2030	1998	1998
query29	1128	625	503	503
query30	318	246	204	204
query31	1143	1081	955	955
query32	107	63	62	62
query33	520	328	265	265
query34	1162	1144	670	670
query35	756	780	683	683
query36	1374	1355	1208	1208
query37	165	110	95	95
query38	3216	3157	3026	3026
query39	923	932	914	914
query39_1	885	892	864	864
query40	223	126	105	105
query41	73	67	71	67
query42	100	96	97	96
query43	317	324	275	275
query44	
query45	200	184	184	184
query46	1086	1187	768	768
query47	2330	2349	2257	2257
query48	414	416	282	282
query49	670	491	369	369
query50	983	351	255	255
query51	4328	4339	4237	4237
query52	92	90	79	79
query53	249	283	194	194
query54	288	248	230	230
query55	87	81	74	74
query56	253	270	241	241
query57	1398	1412	1339	1339
query58	295	212	212	212
query59	1553	1644	1450	1450
query60	279	258	229	229
query61	158	151	157	151
query62	692	666	594	594
query63	231	186	185	185
query64	2559	784	631	631
query65	
query66	1782	469	338	338
query67	29540	29516	29374	29374
query68	
query69	428	320	268	268
query70	974	999	929	929
query71	324	213	214	213
query72	3055	2712	2353	2353
query73	867	779	442	442
query74	5112	4978	4753	4753
query75	2643	2571	2258	2258
query76	2330	1148	786	786
query77	389	380	280	280
query78	12457	12311	11909	11909
query79	1446	1012	755	755
query80	723	467	392	392
query81	475	286	241	241
query82	590	151	120	120
query83	350	279	251	251
query84	260	147	109	109
query85	921	529	433	433
query86	407	296	281	281
query87	3360	3383	3195	3195
query88	3669	2791	2766	2766
query89	434	385	325	325
query90	1809	166	185	166
query91	177	168	137	137
query92	63	60	52	52
query93	1472	1420	859	859
query94	627	351	303	303
query95	716	459	337	337
query96	1104	773	347	347
query97	2686	2690	2551	2551
query98	205	203	208	203
query99	1145	1165	1039	1039
Total cold run time: 251326 ms
Total hot run time: 169348 ms

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 83.33% (25/30) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.91% (21030/39010)
Line Coverage 37.58% (199831/531814)
Region Coverage 33.69% (156901/465686)
Branch Coverage 34.64% (68600/198017)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 83.33% (25/30) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.91% (27487/38226)
Line Coverage 55.43% (294061/530542)
Region Coverage 52.02% (244616/470196)
Branch Coverage 53.29% (105948/198810)

@eldenmoon eldenmoon force-pushed the branch-fix-nullable-hash-no-mutation branch from 7f17d77 to 4e2cf50 Compare June 4, 2026 03:22
@eldenmoon eldenmoon changed the title [fix](be) Avoid mutating nullable columns while hashing [fix](be) Avoid mutating shared Variant columns Jun 4, 2026
@eldenmoon

Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon eldenmoon force-pushed the branch-fix-nullable-hash-no-mutation branch from 4e2cf50 to 691c177 Compare June 4, 2026 03:39
@eldenmoon

Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon eldenmoon force-pushed the branch-fix-nullable-hash-no-mutation branch from 691c177 to df1a913 Compare June 4, 2026 04:11
@eldenmoon

Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon

Copy link
Copy Markdown
Member Author

/review

@eldenmoon eldenmoon force-pushed the branch-fix-nullable-hash-no-mutation branch from df1a913 to be03db5 Compare June 4, 2026 04:33
@eldenmoon

Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the actual GitHub PR diff for PR #64094, which contains the Variant mutation fix and its BE unit tests. I did not find blocking issues.

Critical checkpoint conclusions:

  • Goal and proof: The change prevents cast/serialization from finalizing shared ColumnVariant inputs in place and fixes range copy over pending default suffixes. Added BE tests directly cover source non-mutation and suffix materialization.
  • Scope: The actual PR diff is focused on Variant code and related tests.
  • Concurrency: The relevant concurrency concern is shared blocks from local exchange. The changed cast and serialization paths now finalize private clones before operations that require finalized columns, so they avoid mutating shared input columns.
  • Lifecycle/static initialization: No new static/global lifecycle concerns found.
  • Configuration: No config items added.
  • Compatibility: Serialized Variant layout is unchanged; byte-size and serialize paths were updated consistently to use finalized private copies.
  • Parallel code paths: get_uncompressed_serialized_bytes, serialize, and string/JSONB cast paths were covered. I did not find another Variant finalization path in this diff that needs the same adjustment.
  • Conditional checks: The pending-default suffix insertion is bounded by the requested range and the added tests cover both partial suffix and suffix-only copies.
  • Test coverage: Unit tests were added for Subcolumn::insert_range_from, block/datatype serialization source non-mutation, clone-finalized deep copy, and Variant cast source non-mutation. I did not run tests in this review runner.
  • Test results: No .out result files are involved.
  • Observability: No new observability appears necessary for this internal correctness fix.
  • Transactions/persistence/data writes: Not involved.
  • FE-BE variables/protocol: Not involved.
  • Performance: The extra clone happens only for unfinalized Variant inputs on paths that previously mutated the source; this is a reasonable correctness tradeoff for shared-block safety.

User focus points: No additional user-provided review focus was specified.

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 86.89% (53/61) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.93% (21053/39041)
Line Coverage 37.59% (200165/532474)
Region Coverage 33.68% (157002/466185)
Branch Coverage 34.63% (68673/198305)

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/26930164374

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@eldenmoon eldenmoon force-pushed the branch-fix-nullable-hash-no-mutation branch from be03db5 to 7df8a51 Compare June 4, 2026 06:34
@eldenmoon

Copy link
Copy Markdown
Member Author

run buildall

/review

@eldenmoon

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness issue in the new Variant cast copy-on-write path. The change is focused and the tests cover several non-mutating/finalized cases, but the prefix execution edge below can crash or read out of bounds.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to avoid mutating shared Variant columns during serialization/casts, and it adds BE unit coverage for common serialization and cast paths. The zero-row prefix cast edge is not covered.
  • Scope: The code changes are mostly focused, but the cast path now creates zero-row finalized Variant columns before calling helpers that assume non-empty input.
  • Concurrency/COW: The main local-exchange shared-block hazard is addressed by private finalized copies, but the new copy/resize ordering introduces the issue below.
  • Lifecycle/static/config: No new static lifecycle or configuration concerns found.
  • Compatibility: No serialized layout/protocol compatibility changes found.
  • Parallel paths: Serialization and cast paths were both considered; the issue is specific to cast prefix handling.
  • Tests: Added tests are useful but miss input_rows_count == 0 for non-finalized Variant casts.
  • Observability/transactions/writes: Not applicable.
  • Performance: The private clone-finalize approach can add work, but I did not find a blocking performance regression beyond the correctness issue.

User focus: No additional user-provided review focus was specified.

Comment thread be/src/exprs/function/cast/cast_to_variant.h
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Local exchange and join execution can share input blocks across downstream tasks. Variant cast and block serialization finalized ColumnVariant in place, and Subcolumn::insert_range_from could leave the lazy-default suffix unmaterialized when copying ranges. For local-shuffle anti-join queries that evaluate Variant path casts, one task could mutate a shared Variant column while another task is still reading it, leading to unstable results or range-copy failures. This change finalizes private deep copies for Variant cast/serialization paths, trims serialized Variant cast inputs to the requested input row prefix, and materializes pending defaults during range copy.

The cast path must also handle empty prefixes and legacy root-only unfinalized Variant columns. An empty prefix can otherwise create a zero-row ColumnVariant and then call helpers that assume num_rows > 0. Root-only unfinalized Variant test columns can also have a semantic input row count greater than ColumnVariant::size(), so checking the requested rows against ColumnVariant::size() can crash even though the root column contains the rows being cast.

The fix was reproduced with Variant red tests: the old code finalized source Variant columns during cast/serialization, failed prefix Variant-to-JSONB casts on private finalized copies, failed already-finalized prefix Variant-to-JSONB casts, crashed CastFromVariant on a root-only unfinalized Variant column, and failed to copy pending defaults. The same tests pass after the change. A local four-BE cluster also verified the affected local-shuffle anti-join query with Variant expressions and a non-Variant control query on the same plan shape.

### Release note

Fix an issue where local-shuffle queries using VARIANT expressions could return unstable results or fail.

### Check List (For Author)

- Test:
    - BE Unit Test red/green: ./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant' reproduced the root-only unfinalized Variant crash before the CastFromVariant guard fix and passed after it
    - BE Unit Test: ./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
    - BE Unit Test: ./run-be-ut.sh --run --filter='ColumnVariantTest.insert_range_from_materializes_pending_default_suffix:ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromFinalizedVariantJsonbPrefix:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
    - Build: ./build.sh --be
    - Format: PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.sh, git diff --check
    - Manual test: local four-BE cluster, 800/800 Variant local-shuffle anti-join queries passed, 800/800 non-Variant control queries passed
- Behavior changed: Yes. Variant cast and serialization no longer mutate shared source columns.
- Does this need documentation: No
@eldenmoon eldenmoon force-pushed the branch-fix-nullable-hash-no-mutation branch from 7df8a51 to 5bad3f7 Compare June 4, 2026 08:01
@eldenmoon

Copy link
Copy Markdown
Member Author

Addressed the latest blocking review and the FunctionVariantCast.CastFromVariant crash in 5bad3f7c38512.

Summary:

  • For input_rows_count == 0, cast_from_variant_impl now returns an empty target column before calling ColumnVariant helpers that assume non-empty rows.
  • Removed the invalid row-count guard that compared the requested semantic row count with ColumnVariant::size(). The reported crash used a root-only unfinalized Variant test column whose root has 3 rows while ColumnVariant::size() is 0.
  • Added FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn covering string and JSONB zero-row prefix casts on non-empty unfinalized Variant input.

Verified locally:

  • ./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
  • ./run-be-ut.sh --run --filter='ColumnVariantTest.insert_range_from_materializes_pending_default_suffix:ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromFinalizedVariantJsonbPrefix:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
  • PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.sh
  • git diff --check

@eldenmoon

Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary:

No additional blocking issues found in this pass. I reviewed the changed Variant serialization, cast-from-Variant prefix handling, Subcolumn range-copy default materialization, and the added BE unit tests. The existing inline thread about zero-row helper access is already known review context; the current head contains an early zero-row return, so I did not duplicate it.

Critical checkpoint conclusions:

  • Goal and tests: The change targets in-place finalization of shared ColumnVariant inputs in cast/serialization paths and pending default suffix handling. The code uses private finalized copies and adds focused BE UT coverage for serialization, cast prefixes, nullable inputs, and default suffix copying.
  • Scope and clarity: The modification is focused on the affected Variant paths and tests.
  • Concurrency: The relevant concurrency risk is shared Block/ColumnVariant observation across local exchange consumers. The reviewed paths avoid mutating shared Variant inputs by finalizing private copies. No new locks or lock-order concerns were introduced.
  • Lifecycle/static initialization: No new special lifecycle or cross-translation-unit static initialization risk observed.
  • Configuration: No new configuration items.
  • Compatibility: No serialized format/schema change observed; serialization still writes the same fields.
  • Parallel code paths: Cast-from-Variant string/JSONB and DataTypeVariant byte-size/serialize paths were both updated.
  • Conditional checks: Prefix and zero-row handling are explicit and covered by tests.
  • Test coverage/results: Added BE UTs cover the main changed paths. I attempted a targeted run with ./run-be-ut.sh --run --filter=..., but this runner failed during gensrc setup before tests because /home/runner/work/doris/doris/thirdparty/installed/bin/protoc is missing.
  • Observability: No new observability appears necessary for this internal correctness fix.
  • Transaction/persistence/data-write correctness: Not applicable; this PR does not alter transaction, persistence, visible-version, or write commit logic.
  • FE/BE protocol variables: Not applicable.
  • Performance: The fix intentionally trades extra private copies for correctness in serialization/cast paths. I did not find an avoidable correctness-impacting performance issue.

User focus points: No additional user-provided review focus was specified.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29239 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 5bad3f7c385126fdabdac18cc9d522aff4713480, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17668	3985	3968	3968
q2	q3	10751	1361	836	836
q4	4683	480	344	344
q5	7624	878	593	593
q6	183	172	137	137
q7	780	857	642	642
q8	9376	1647	1720	1647
q9	5722	4456	4515	4456
q10	6795	1849	1555	1555
q11	435	270	251	251
q12	631	422	288	288
q13	18125	3274	2798	2798
q14	265	260	244	244
q15	q16	818	774	713	713
q17	987	857	988	857
q18	6970	5831	5557	5557
q19	1332	1301	1038	1038
q20	521	402	269	269
q21	6270	2876	2734	2734
q22	469	369	312	312
Total cold run time: 100405 ms
Total hot run time: 29239 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4990	4734	4851	4734
q2	q3	4795	5317	4744	4744
q4	2110	2206	1394	1394
q5	4776	4994	4717	4717
q6	228	176	129	129
q7	1864	1828	1550	1550
q8	2402	2100	2078	2078
q9	7906	7741	7396	7396
q10	4734	4686	4212	4212
q11	529	380	350	350
q12	734	741	529	529
q13	3047	3383	2816	2816
q14	282	278	246	246
q15	q16	675	697	605	605
q17	1276	1245	1231	1231
q18	7405	6913	6859	6859
q19	1131	1074	1096	1074
q20	2229	2219	1961	1961
q21	5258	4546	4408	4408
q22	519	472	421	421
Total cold run time: 56890 ms
Total hot run time: 51454 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 168792 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 5bad3f7c385126fdabdac18cc9d522aff4713480, data reload: false

query5	4312	634	477	477
query6	438	202	181	181
query7	4824	591	304	304
query8	388	220	209	209
query9	8772	3992	3965	3965
query10	466	308	260	260
query11	5940	2369	2191	2191
query12	165	104	100	100
query13	1281	612	434	434
query14	6453	5381	5100	5100
query14_1	4307	4343	4347	4343
query15	207	199	178	178
query16	1011	445	415	415
query17	922	680	552	552
query18	2458	467	345	345
query19	200	182	143	143
query20	106	106	107	106
query21	215	137	114	114
query22	13581	13638	13330	13330
query23	17320	16595	16144	16144
query23_1	16369	16158	16283	16158
query24	7484	1772	1330	1330
query24_1	1301	1293	1291	1291
query25	533	449	387	387
query26	1337	325	164	164
query27	2671	552	315	315
query28	4534	2034	2026	2026
query29	1107	620	497	497
query30	311	239	203	203
query31	1170	1065	963	963
query32	133	60	58	58
query33	507	315	247	247
query34	1217	1133	636	636
query35	760	783	675	675
query36	1389	1398	1248	1248
query37	150	103	84	84
query38	3218	3135	3047	3047
query39	939	923	917	917
query39_1	886	883	880	880
query40	215	119	97	97
query41	65	61	60	60
query42	94	93	94	93
query43	317	316	275	275
query44	
query45	189	182	180	180
query46	1080	1192	743	743
query47	2408	2375	2250	2250
query48	366	400	297	297
query49	629	475	354	354
query50	974	363	264	264
query51	4342	4276	4196	4196
query52	93	97	82	82
query53	244	273	187	187
query54	287	235	191	191
query55	80	77	69	69
query56	230	247	230	230
query57	1433	1393	1323	1323
query58	261	215	202	202
query59	1563	1674	1421	1421
query60	292	255	227	227
query61	154	148	146	146
query62	711	664	586	586
query63	237	198	181	181
query64	2594	799	614	614
query65	
query66	1804	454	351	351
query67	29869	29779	29553	29553
query68	
query69	419	292	259	259
query70	938	933	920	920
query71	301	222	215	215
query72	2940	2643	2376	2376
query73	857	724	410	410
query74	5173	4981	4768	4768
query75	2710	2586	2294	2294
query76	2323	1144	788	788
query77	368	369	295	295
query78	12343	12310	11814	11814
query79	1305	1026	798	798
query80	564	514	426	426
query81	455	285	251	251
query82	247	162	124	124
query83	277	277	264	264
query84	293	145	118	118
query85	918	604	526	526
query86	332	316	292	292
query87	3342	3371	3231	3231
query88	3670	2767	2759	2759
query89	430	425	331	331
query90	2187	195	181	181
query91	169	159	137	137
query92	66	60	60	60
query93	1557	1485	895	895
query94	528	351	308	308
query95	686	457	343	343
query96	1055	789	348	348
query97	2698	2676	2560	2560
query98	206	210	201	201
query99	1171	1163	1034	1034
Total cold run time: 250891 ms
Total hot run time: 168792 ms

@eldenmoon eldenmoon closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants