[fix](ann-index) Fix ANN IVF/PQ recall, avoid init-time large ANN build-buffer reservation, and skip ANN index build for segments with insufficient rows.#64082
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
1 similar comment
|
run buildall |
a66b5d7 to
582071f
Compare
|
run buildall |
TPC-H: Total hot run time: 29269 ms |
TPC-DS: Total hot run time: 169468 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
### What problem does this PR solve? Issue Number: None Related PR: apache#64082 Problem Summary: Clarify why ANN index writer swaps the buffered vectors with an empty PODArray instead of using clear(). The swap intentionally releases the full-segment training buffer before saving the index, while clear() would keep the allocated capacity. ### Release note None ### Check List (For Author) - Test: No need to test (comment-only change) - Behavior changed: No - Does this need documentation: No
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
### What problem does this PR solve? Issue Number: None Related PR: apache#64082 Problem Summary: Remove the redundant ANN writer `_skip_build` state. The flag was only set from `close_on_error()`, while normal index skip behavior is already driven by zero rows or by the segment row count being smaller than the index training requirement. Keeping the writer state explicit avoids carrying an abort flag into regular add and finish paths. ### Release note None ### Check List (For Author) - Test: Unit Test - `ENABLE_PCH=OFF ./run-be-ut.sh --run --filter=AnnIndexWriterTest.*` - Behavior changed: No - Does this need documentation: No
TPC-H: Total hot run time: 29312 ms |
TPC-DS: Total hot run time: 169349 ms |
…added no-train indexes during segment writing. This made the build strategy harder to reason about and could still spend CPU/memory building small HNSW/FLAT segments that should be skipped by a Doris-side row threshold. This change removes the chunk add configs, buffers ANN vectors for the whole segment, applies effective_min_rows = max(vector_index->get_min_train_rows(), config::ann_index_build_min_segment_rows) in finish(), and then trains when needed, adds once, releases the build buffer, and saves the index. Empty segments or segments below the effective threshold delete only the current index entry instead of persisting an ANN index. Add BE config ann_index_build_min_segment_rows to skip persisting ANN indexes for small segments. Remove ann_index_build_add_chunk_size and ann_index_build_add_chunk_bytes.
|
run buildall |
TPC-H: Total hot run time: 29312 ms |
TPC-DS: Total hot run time: 169025 ms |
|
|
||
| _dir = compound_dir.value(); | ||
|
|
||
| _min_segment_rows = AnnIndexColumnWriter::min_segment_rows(); |
There was a problem hiding this comment.
这行代码是在干啥?
Minimum segment rows required to persist an ANN index.
| return Status::OK(); | ||
| } | ||
|
|
||
| Status AnnIndexColumnWriter::_build_and_save(Int64 min_train_rows, Int64 effective_min_rows) { |
There was a problem hiding this comment.
这个函数,为什么要有min_train_rows 这个参数?
### What problem does this PR solve? Issue Number: None Related PR: apache#64082 Problem Summary: The ANN writer buffers vectors through an internal helper after validating array dimensions in add_array_values(). Add a short comment to make the validation precondition explicit for the buffer helper path. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#64082 Problem Summary: The ANN writer used a tiny helper only to compute max(min_train_rows, ann_index_build_min_segment_rows). Inline the single-use calculation in finish() to keep the build threshold logic local and reduce unnecessary indirection. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check - Ran rg to verify _effective_min_rows has no remaining references - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#64082 Problem Summary: The ANN writer had small single-use helpers and a cached min segment rows member after switching to finish-time buffering. Inline vector buffering, buffer release, and direct ann_index_build_min_segment_rows access at their call sites to keep the writer implementation simpler. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check - Ran rg to verify _append_vectors_to_buffer, _release_buffered_vectors, _min_segment_rows, and min_segment_rows() have no remaining references - Behavior changed: No - Does this need documentation: No
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29625 ms |
TPC-DS: Total hot run time: 170617 ms |
|
/review |
There was a problem hiding this comment.
I found one blocking issue in the ANN writer change. The PR improves small-segment skip behavior and fixes the IVF_ON_DISK min-training-row calculation, but replacing the old chunked build path with an unbounded full-segment PODArray buffer can make ANN index build allocate rows * dim * sizeof(float) outside Doris memory tracking until finish().
Critical checkpoint conclusions:
- Goal and tests: the PR targets IVF/PQ recall, init-time reserve removal, IVF_ON_DISK min-train handling, and small segment skip behavior; tests cover these behaviors, but they do not cover high-dimensional/large-segment memory pressure introduced by the new buffering strategy.
- Scope/focus: the implementation is mostly focused, but the removal of chunked add/train changes the memory bound of the writer substantially.
- Concurrency: no new shared mutable state or lock ordering issue was found in the changed writer path; existing FAISS OpenMP budget remains used for train/add.
- Lifecycle/static initialization: no new cross-TU static initialization or ownership-cycle issue was found.
- Configuration: ann_index_build_min_segment_rows is mutable, validated non-negative, and read at finish(), so runtime changes can affect subsequent segment builds.
- Compatibility/storage format: skipping a segment deletes the current index directory entry rather than writing a new format; reader load failure falls back to brute-force paths, so no storage-format incompatibility was found.
- Parallel paths: normal ANN build and IVF_ON_DISK min-train paths were considered; no missing parallel code path was found beyond the memory-bound regression noted inline.
- Error handling/data correctness: Status returns in the changed writer path are propagated; missing/small indexes appear to downgrade to data scan instead of returning incomplete results.
- Performance/memory: blocking issue found: full-segment buffering is unbounded and not MemTracker-reserved.
- Observability: existing logs are sufficient for skipped segments; no additional blocking observability issue found.
User focus: no additional user-provided review focus was specified.
| // The offsets check above guarantees every array row matches the ANN index dimension. | ||
| DCHECK(p != nullptr); | ||
| _buffered_vectors.insert(_buffered_vectors.end(), p, p + num_rows * dim); | ||
| _total_rows += cast_set<int64_t>(num_rows); |
There was a problem hiding this comment.
This changes the ANN writer from the old bounded chunk buffer (ann_index_build_add_chunk_size * dim) to retaining every vector in the segment until finish(). That makes the allocation segment_rows * dim * sizeof(float) in _buffered_vectors, and ANN dim is only validated as positive while segment splitting is not based on this buffer. For example, a high-dimensional ANN column can accumulate hundreds of MB or GB in this PODArray before FAISS train/add runs, and this allocation is not reserved against a Doris MemTracker. This reintroduces the OOM risk the PR is trying to avoid, just during load/append instead of init(). Please keep the training input bounded/tracked (or enforce a byte cap/reservation and fail cleanly) instead of unconditionally buffering the full segment.
|
skip check_coverage |
…ld-buffer reservation, and skip ANN index build for segments with insufficient rows. (apache#64082)
…ld-buffer reservation, and skip ANN index build for segments with insufficient rows. (#64216) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note Cherry-pick #64082 ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary:
This PR fixes several ANN index build issues:
ann_index_build_chunk_size * dimfloats during init, which could allocate excessive memory immediately for high-dimensional vectors.nlistas its minimum FAISS training row requirement.This PR changes the build behavior as follows:
ann_index_build_min_segment_rowsso small ANN indexes can be skipped by a Doris-side row threshold.Release note
Fix ANN IVF/PQ recall, avoid init-time large ANN build-buffer reservation, and skip ANN index build for segments with insufficient training rows.
Check List (For Author)
Test
./run-regression-test.sh --run -d ann_index_p0 -s ivf_pq_full_buffer_train_recallrun buildallBehavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)