Add FSSTView encoding: a ListView-style FSST array#8191
Conversation
FSSTView addresses its FSST-compressed codes with separate `offsets` and `sizes` arrays (like ListView) instead of FSST's single monotonic offsets array (like List/VarBin). Decoupling start from length means offsets need not be monotonic or contiguous, so filter/take/slice become metadata-only: they rewrite only the small offsets/sizes/lengths/validity arrays and reuse the compressed byte heap and symbol table untouched. This avoids the heap rewrite that plain FSST incurs on filter/take (which delegate to VarBin), giving the same speed win ListView has over List. - New `vortex.fsstview` encoding in the fsst crate, reusing FSSTData for the symbol table + compressed byte heap. Children are declared with the `#[array_slots(FSSTView)]` proc macro (uncompressed_lengths, codes_offsets, codes_sizes, codes_validity). - Metadata-only FilterKernel, TakeExecute, and SliceReduce. - scalar_at decodes a single element via its offset+size slice. - Canonicalization gathers the live codes (possibly out-of-order) and bulk-decompresses into a VarBinView. - `fsstview_from_fsst` zero-copy conversion from an FSST array. - Registered in `register_default_encodings`. - Tests: canonical/filter/take/slice equivalence vs FSST, scalar_at, and filter/take/consistency conformance for nullable and non-nullable data. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…bench
Adds the second hop and the canonicalization decision for the FSSTView
pipeline, plus a benchmark that measures the trade-off directly.
- `fsst_filter_to_view` / `fsst_take_to_view`: reinterpret an FSSTArray as an
FSSTView (sharing symbols + codes bytes) and apply the metadata-only kernel,
so filtering/taking an FSSTArray never rewrites the compressed byte heap.
- Canonicalization now chooses a compaction strategy (FsstViewCompaction):
- Direct: live codes still contiguous/in-order (untouched or sliced view) ->
one bulk decompress, no copy.
- GatherBulk ("compact"): copy the scattered live codes contiguous, then one
bulk decompress. Wins when strings are short/numerous (per-call overhead
dominates otherwise; the gather is cheap and unlocks bulk SIMD).
- PerElement ("no compact"): decompress each element's slice in place, no
copy. Wins when strings are long/few (the gather copy dominates).
Auto picks Direct when contiguous, else GatherBulk/PerElement by average
compressed bytes/element. `canonicalize_fsstview_with` exposes each strategy
for benchmarking.
- benches/fsst_view_compute.rs: calls kernels directly (no dispatch) and
measures each part. filter (selective/non-selective), take (shuffle /
selective / dense), and a filter+take combo, over two ~2 MiB inputs (many
short strings, fewer long strings). fsst pipeline compacts into a fresh
FSSTArray each step then canonicalizes; fsstview pipeline stays metadata-only
then canonicalizes under each compaction strategy.
- Tests: from_fsst helpers vs canonical, and all compaction strategies agree
on both contiguous and scattered views.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The fsst_view_compute benchmark (two ~2 MiB inputs, ~12-byte and ~256-byte strings) shows GatherBulk beats PerElement across the entire tested range, not just for short strings as originally guessed. FSST's decoder has a fast 8-wide body and a slow byte-by-byte tail; PerElement pays that tail once per element while GatherBulk pays it once for the whole heap, which dominates the gather memcpy even at 256-byte strings. Selected medians (canonicalize after the metadata-only hop): take few_long/shuffle: gather 459us vs per_element 623us take few_long/dense: gather 838us vs per_element 981us filter many_short/nonsel: gather 5.38ms vs per_element 5.92ms And the metadata-only hop itself is far cheaper than compacting FSST: take_step many_short/shuffle: view 650us vs fsst 2.84ms (~4x) take_step many_short/dense: view 604us vs fsst 4.15ms (~7x) So Auto now picks Direct when the live codes are contiguous and GatherBulk otherwise; it never selects PerElement (kept selectable for measurement, wins only in the few-very-long-strings extreme outside real columns). Drops the SHORT_STRING_THRESHOLD heuristic and updates the docs to the measured behavior. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
- fsstview_from_fsst now reuses the FSST offsets buffer for codes_offsets via a
zero-copy slice of its first `len` elements, instead of re-copying into a new
Vec. Only the derived sizes array is freshly allocated.
- Add chain_pipeline_{fsst,view} benches: a 5-op alternating filter/take chain
ending in a canonicalize. This is where the view model is meant to win — each
fsst op re-compacts the byte heap (cost compounds with chain length), while
the view converts once and chains metadata-only ops, deferring the single
gather+decode to the end.
Measured medians (100 samples):
FewLong: fsst 765us -> view 481us (1.6x)
ManyShort: fsst 14.49ms -> view 9.64ms (1.5x)
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
… finding Implements the "compact like a list / export paired slices into a VarBinView" idea: decode contiguous heap runs straight into a heap-ordered buffer and point VarBinView views back into it out of order, with no gather copy and duplicate dedup. Wired as FsstViewCompaction::RunCoalesce, hash-free (sort-based), handles nulls/empties/duplicates; covered by an adversarial gaps+shuffle+nullable test and the all-strategies-agree test. Benchmark verdict: it loses to GatherBulk everywhere, badly for short strings (take many_short/shuffle ~18ms vs ~5.6ms). The random access you avoid at decode time reappears at view-build time: views are built in element order over a heap-ordered output, so make_view does N cache-missing random reads (and random inlining copies for <=12-byte strings), plus an O(N log N) sort. GatherBulk's output is element-ordered, so its view-build is sequential; the cheap sequential gather memcpy beats the scattered view construction. So Auto keeps using Direct (contiguous) / GatherBulk (otherwise) and never picks RunCoalesce; it's retained as a selectable, measurable baseline. Docs updated with the full reasoning. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…nches Callgrind on a shuffle take showed the kernel's cost was dominated by running take -> fill_null -> cast -> optimize three times (offsets/sizes/lengths). The fill_null + cast is only needed when a null index could introduce a null, i.e. when the indices are nullable. For non-nullable indices (the common case) the children stay non-nullable, so we now skip fill_null entirely. Re-profiling confirms fill_null (~450K ir) and its cast (~252K ir) drop out and the take kernel falls from ~612K to ~474K instructions per call. Also add take_op_only_view / filter_op_only_view benches that hoist the one-time FSST->view conversion out of the timed loop, isolating the metadata-only op. These show the op is constant-time regardless of size or selectivity (~457 ns filter, ~657 ns take), like a ListView op — the earlier "view loses on selective" was purely the O(n) conversion being charged to every op, which only the first op of a chain actually pays. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ge idea
Adds FsstViewByteStats / fsstview_byte_stats reporting, in both compressed (code)
and uncompressed (decoded) space: live vs run-spanned vs whole-heap bytes,
distinct spans, run count, and the dead-byte waste a gap-merged decode would
carry. A byte_stats_report test prints it for a selective filter and a shuffle
take (run with --nocapture).
This quantifies why merging across gaps to keep decode runs long doesn't pay:
filter_10pct (keep ~10% of 65536):
runs=5945 over 6616 survivors (avg ~1.1 elem/run -> survivors are isolated)
compressed: live=25.8KB, heap=255KB
full-heap-merge waste = 89.9% (you'd decode ~10x the needed compressed bytes)
shuffle_take (reorder all):
runs=1, waste=0% (RunCoalesce's ideal) -- yet it still loses on time to
GatherBulk because the random access just moves to view-build.
So the dead-value budget the gap-merge idea needs is blown immediately on a
selective filter (90% dead), and on the one input where merging is free
(shuffle, 0% dead) GatherBulk still wins. There's also a hard blocker: after a
filter the dead elements' uncompressed_lengths are gone and FSST decode only
returns a total written count, so a single gap-merged decode can't even locate
post-gap survivors. Conclusion: GatherBulk (zero waste) / Direct (contiguous)
remain the right canonicalization; the stats make the trade-off measurable.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The only overhead GatherBulk carries over the theoretical minimum is the gather memcpy, and it was copying every element's span individually. For an order-preserving filter, surviving neighbours are still heap-adjacent, so a run of k survivors can be copied in one memcpy instead of k. The gather now accumulates a contiguous [run_start, run_end) heap range and flushes it once per run, making the copy cost proportional to the number of runs rather than the number of elements. This is a strict win where survivors form long runs (non-selective filter: many_short/nonselective canonicalize ~5.38ms -> ~4.75ms) and a no-op for a shuffle (no adjacency -> one copy per element as before, behind a cheap branch). Combined with Direct (single contiguous run, zero copy), the export is now optimal: gather work scales with run count, then one bulk decode, then a sequential element-ordered view-build. Correctness: spans are still emitted in element order, so the decoded buffer stays element-ordered; coalescing only fires on genuine zero-gap adjacency. Covered by the existing all-strategies-agree and gaps+shuffle+nullable tests. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Add canonicalize_fsstview_to_varbin: reuses the element-ordered decode path of
the VarBinView canonicalizer (Direct/GatherBulk/PerElement), but the finisher
builds len+1 cumulative offsets over the contiguous decoded bytes instead of a
16-byte view per element. Covered by varbin_export_matches_canonical across all
element-ordered strategies on gapped nullable data.
Add export_{fsst,view}_to_{varbin,varbinview} benches: a single filter then
export, the full {fsst, fsstview} x {VarBin, VarBinView} matrix.
Medians (100 samples), single filter + export:
many_short (174k x ~12B): fsst->VBV fsst->VB view->VBV view->VB
nonselective 90% 3.11ms 2.63ms 4.75ms 2.64ms
selective 10% 561us 534us 1.02ms 728us
few_long (8k x ~256B):
nonselective 90% 467us 328us 472us 303us
selective 10% 73us 55us 77us 72us
Takeaways:
- VarBin export is consistently cheaper than VarBinView for both encodings: no
per-element 16-byte view construction, just an offsets cumsum. Biggest gap on
many short strings (view->VB 2.64ms vs view->VBV 4.75ms, ~1.8x).
- For a single filter, fsst is competitive with or ahead of fsstview: fsst pays
the heap rewrite once during the (cheap-when-selective) filter, while fsstview
pays a gather at export plus the one-time FSST->view conversion. The view's
advantage is in chains, where the per-op heap rewrite is amortized away (see
the chain_pipeline benches).
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Adds convert_varbin_to_varbinview: takes the VarBin produced by the view->VarBin
export and times converting it into a VarBinView, isolated (conversion only).
Medians (100 samples):
many_short (174k x ~12B): nonsel 2.29ms selective 203us
few_long (8k x ~256B): nonsel 197us selective 12us
This answers "is decode-to-VarBin-then-convert cheaper than decode-straight-to-
VarBinView?" when the consumer wants a view:
many_short/nonsel, to reach a VarBinView:
view->VarBin (3.42ms) + convert (2.29ms) = 5.71ms
view->VarBinView直 = 4.92ms
Going via VarBin is worse (5.71ms vs 4.92ms): the conversion adds back the
per-element view construction you skipped, plus a full re-decode/copy of the
bytes. So VarBin export is only the right target when the consumer actually
wants offsets+bytes (VarBin); if the result must be a VarBinView, decode
straight to it.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…lace") Two parts, driven by what realistic query shapes revealed. Benches: real query masks are rarely uniform-random. Add Selectivity shapes (uniform / range-scan / clustered bursts) and a sorted-index "index lookup" take, plus a canon_only bench that isolates the export decode by strategy. These showed that *run length* (set by the selection shape), not raw selectivity, drives the view's export cost — and that uniform-random was the view's worst case all along. Optimization: callgrind showed fsstview_from_fsst is ~21% of a single-op filter->VarBinView (it derives sizes for all n elements before the filter discards most). Rather than fuse convert+filter (which would break composition across a chain of filters/takes), the new RunDecode export strategy attacks the gather instead: when survivors are monotonic (after any filter / sorted take / slice), decode each contiguous heap run directly into the element-ordered output with NO gather copy. Output stays element-ordered so the view-build is sequential (unlike RunCoalesce). Auto now chooses "export all in place" (RunDecode) vs "compact codes then export" (GatherBulk) by run count: runs <= len/4 and monotonic -> RunDecode, else GatherBulk. canon_only medians (export decode only): many_short clustered: RunDecode 313us vs GatherBulk 333us (Auto -> 313us) many_short range: RunDecode 345us vs GatherBulk 370us (Auto -> 333us) many_short uniform: GatherBulk 561us vs RunDecode 657us (Auto -> 563us) Auto picks the winner on every shape. The conversion and metadata-only filter/take stay separate, so chains still compose; only the final canonicalize compacts or not. Covered by run_decode_monotonic_filter (nulls/empties/multi-run/trailing-run) and the existing all-strategies-agree tests; 111 tests pass. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Callgrind on fsstview_from_fsst in isolation showed BufferMut::push was 68% of the conversion: deriving the sizes array pushed element-by-element, and push does a reserve(1) capacity recheck every iteration even though capacity is reserved up front. Switching the size-derivation loop to push_unchecked over offsets.windows(2) (capacity guaranteed) lets it vectorize. Conversion instruction count: 20.7B -> 1.73B over 3000 iters (~12x). End-to-end single filter -> VarBinView (many_short, where conversion was ~21%): clustered 868us -> 406us (2.1x) range 885us -> 417us (2.1x) uniform 1108us -> 689us (1.6x) This flips the single-op verdict: the view now beats fsst on clustered and range filters (the realistic DB selection shapes), losing only on the adversarial uniform-random case. Combined with the RunDecode export heuristic, the view is now competitive even for a single op, not just chains. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…sum) Profiling the full single-op pipeline (post conversion-fix) showed canonicalize_fsstview_with self-cost at ~19%, dominated by materializing three Vec<usize> (offsets, sizes, ulens) and the VarBin exporter's per-element offset push. - total_size is now summed straight from the typed uncompressed-lengths slice; the widened ulens: Vec<usize> is built lazily (widen_ulens) only by the run/per-element decoders. Direct and GatherBulk no longer allocate it. - The VarBin exporter builds its len+1 cumulative offsets directly from the typed slice with push_unchecked (capacity reserved), instead of widening to Vec<usize> and pushing element-by-element. Net: VarBin export many_short/selective ~726us -> ~467us; the canon decode itself is unchanged (RunDecode already avoided the ulens Vec). All strategies still agree; 111 tests pass. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Adds fsst_view_fineweb, which benchmarks FSST vs FSSTView on two real columns
from the HuggingFace FineWeb 10BT sample, instead of synthetic data. The sample
is ~2GB so it isn't downloaded by the bench; it reads length-prefixed dumps of
the `url` (~72B avg, 200k rows) and `text` (~3KB avg, 40k rows) columns produced
once with DuckDB (recipe in the module docs). The bench no-ops if the
FINEWEB_URL / FINEWEB_TEXT env vars are unset, so CI stays green.
Two workloads, fsst (rewrite heap per op) vs fsstview (metadata-only per op):
single filter -> VarBinView, and a 5-op filter/take chain -> VarBinView.
Real-data medians:
fsst fsstview speedup
single_filter url 1.02ms 0.84ms 1.2x
single_filter text 5.81ms 4.38ms 1.3x
chain url 6.23ms 3.95ms 1.6x
chain text 44.2ms 5.16ms 8.6x
The view wins every real case, decisively on chained ops over long strings
(text chain 8.6x): fsst rewrites the large code heap on every op while the view
stays metadata-only and decodes once at the end. On real (longer) strings the
FSST->view conversion is no longer a notable cost — the earlier synthetic
"conversion dominates" finding was an artifact of 12-byte strings plus
per-op-in-a-loop measurement; the design win (metadata-only chaining) is what
actually pays off.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Removes exploratory scaffolding now that the design has settled, keeping only what production uses: - Compaction strategies: drop PerElement and RunCoalesce (both proven to lose; Auto never picked them) and the FsstViewByteStats diagnostics. FsstViewCompaction is now Auto / Direct / GatherBulk / RunDecode. - canonical.rs: factor the shared element-ordered decode into decode_element_ordered, reused by the VarBinView and VarBin finishers; ~600 -> ~430 lines. - Synthetic bench (fsst_view_compute): replace the 945-line exploration matrix with a focused single-filter + 5-op-chain comparison over two shapes, mirroring the FineWeb bench. Real-data benchmarking lives in fsst_view_fineweb. - Tests: drop the removed-strategy cases and the byte-stats report; keep the all-strategies-agree, gapped-filter, RunDecode-monotonic, VarBin-export, and conformance coverage. Net -1361/+195 lines. 107 tests pass, clippy clean, fmt clean, vortex-file builds. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Adds fsst_view_fineweb_queries, which materializes a string column under the
actual WHERE predicates from the vortex-bench FineWeb queries (dump = ..., date
LIKE '2020-10-%', url/text LIKE '%google%', '%espn%', '% vortex %', ...). Each
predicate is evaluated once in DuckDB against the real HuggingFace 10BT sample to
produce an authentic per-row selection mask (recipe in fineweb_queries_extract.py);
the bench applies that mask to the FSST-compressed url/text column and decodes to
a VarBinViewArray, fsst vs fsstview. No-ops if FINEWEB_DIR is unset.
The real masks span the spectrum: clustered selections (dump_eq 7%/177 runs,
date_prefix 12%) vs scattered LIKE-containment (google_or 2%/4046 runs) vs tiny
(vortex 0.04%, espn ~0.08%).
Real-query medians:
fsst view
text/date_prefix 63.4ms 43.9ms view 1.4x
text/dump_eq 40.9ms 26.0ms view 1.6x
text/google_or 26.8ms 21.4ms view 1.25x
url/dump_eq 1.13ms 0.94ms view 1.2x
url/google_and 30us 164us fsst (tiny, very selective)
url/vortex 8us 140us fsst (tiny)
Two regimes: on bulk-ish selections over the long text column the view wins
(1.3-1.6x) by skipping fsst's per-op heap rewrite; on highly selective predicates
over the short url column fsst wins because its filter rewrites an almost-empty
heap while the view pays a fixed ~130us floor converting all 200k offsets before
filtering discards >99% of them. Both are sub-millisecond there, but it confirms
on real query masks that converting the whole column ahead of a very selective
predicate is the view's one real weakness.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Documents what each benchmark measures (fsst_view_compute synthetic shapes, fsst_view_fineweb real columns, fsst_view_fineweb_queries real query predicates), the workloads, and the headline median results, plus how Auto picks the decode strategy. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
FSSTVIEW_HANDOVER.md summarizes the encoding, the three benchmarks and their median results, the Auto decode strategy, the known conversion-floor limitation, and the profiling methodology. FSSTVIEW_NEXT_PROMPT.md is a copy-paste prompt to continue the work (eliminate the selective-filter conversion floor). Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Replace the per-element `codes_sizes` child with `codes_ends` (the end offset, `offset + size`). A freshly converted FSST heap is contiguous, so element `i` occupies `offsets[i]..offsets[i + 1]` and both addressing arrays are now zero-copy slices of the FSST's existing monotonic offsets buffer (`codes_offsets = offsets[0..len]`, `codes_ends = offsets[1..len + 1]`). `fsstview_from_fsst` therefore allocates and copies nothing and no longer materializes a per-row `sizes` array, so a selective `filter`/`take` that keeps a handful of rows never pays an O(rows) cost to derive sizes for the rows it discards. The per-element size is recovered as `codes_ends[i] - codes_offsets[i]` only at canonicalize / `scalar_at`, over the survivors only. `filter`/`take`/`slice` stay metadata-only and compose across a chain (they carry `codes_ends` alongside `codes_offsets`); the conversion is not fused into the filter. Same-machine before/after on the real `fsst_view_fineweb_queries` bench (divan medians): `url/vortex` 140 us -> 9.1 us, `url/espn_and` 146 us -> 14.9 us, `text/espn_and` 407 us -> 271 us (flips to a view win), while the previously winning clustered cases hold (`text/dump_eq` 25.3 ms, 1.68x; `text/date_prefix` 41.4 ms, 1.67x). The view now wins or ties every query in the matrix. 107 tests pass; clippy --all-targets --all-features clean; cargo +nightly fmt clean; vortex-file builds; doc tests pass. README and handover updated. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ench Hot-path cleanups on top of the codes_ends representation: - canonicalize: derive each survivor's size in place from the widened `ends` buffer instead of allocating a third index Vec, and sum `live` (total compressed bytes) only on the bulk-decode paths that use it rather than unconditionally up front (RunDecode never needs it). - fsstview_from_fsst: construct via `new_unchecked`. Every FSSTView invariant is already guaranteed by the source FSSTArray, so re-running `validate_fsstview` on the hot conversion path is wasted work. Trim the test/bench surface for merge: - Drop the `fsst_view_fineweb` bench: its multi-op chain is already covered synthetically by `fsst_view_compute`, and its column materialization overlaps `fsst_view_fineweb_queries`. - Remove the filter/take/slice `*_matches_canonical` smoke tests; the framework conformance tests and the strategy-agreement tests already cover those paths. 104 tests pass; clippy --all-targets --all-features clean; cargo +nightly fmt clean; vortex-file builds; doc tests pass. README and handover updated. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The conversion-floor fix rests on codes_offsets/codes_ends being zero-copy slices of the FSST's single monotonic offsets buffer (offsets[0..len] and offsets[1..len+1]) — no copy, no per-element sizes array. Nothing tested that invariant directly: the value/agreement tests would still pass if the conversion were reverted to materialize sizes (silently reintroducing the floor), and the bench that measures the floor is gated out of CI. Add conversion_shares_offsets_buffer_zero_copy, which asserts structurally that a freshly converted view's codes_ends slice begins exactly one element past codes_offsets in the same allocation. Deterministic, no timing. 105 tests pass; clippy --all-targets --all-features clean; fmt clean. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
RunDecode walked the uncompressed-lengths array into a Vec<usize> solely to advance its output cursor by each run's total uncompressed length. But Decompressor::decompress_into already returns the exact decoded byte count for the bytes it just wrote (the same value Direct uses for set_len), and that count equals the run's uncompressed length. Advance out_pos by the return value instead. This removes one O(survivors) allocation on the clustered/range path (the text/dump_eq, text/date_prefix wins) and the per-element run_uncompressed accumulation, and deletes the now-unused widen_ulens helper. A debug_assert_eq!(out_pos, total_size) documents and checks the cursor invariant. The inter-run 7-byte decode slack behaviour is unchanged: each run still decodes exactly its own compressed input. 105 tests pass (incl. the RunDecode-exercising gaps/monotonic-filter tests); clippy --all-targets --all-features clean; fmt clean. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The three FSSTViewMetadata::get_*_ptype getters were each used once, in deserialize, and only wrapped PType::try_from with a custom error message. The sibling FSST encoding's own deserialize already inlines PType::try_from(metadata.x)? directly (its TryFrom error converts to VortexError via ?), so match that: inline the three calls and delete the getter impl block. This also drops the now-unused vortex_err import. Also refresh a comment in canonical.rs that still referenced the ulens: Vec<usize> precompute removed in the previous commit. 105 tests pass; clippy --all-targets --all-features clean; fmt clean. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.976x ➖ How to read Verdict and Engines
datafusion / vortex-file-compressed (0.976x ➖, 1↑ 0↓)
No file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.066x ➖, 0↑ 3↓)
datafusion / vortex-compact (1.063x ➖, 0↑ 1↓)
datafusion / parquet (1.068x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.032x ➖, 1↑ 2↓)
duckdb / vortex-compact (1.052x ➖, 0↑ 1↓)
duckdb / parquet (1.062x ➖, 0↑ 2↓)
No file size changes detected. Full attributed analysis
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.050x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.055x ➖, 0↑ 1↓)
datafusion / parquet (1.052x ➖, 0↑ 1↓)
datafusion / arrow (1.060x ➖, 0↑ 3↓)
duckdb / vortex-file-compressed (1.040x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.040x ➖, 0↑ 0↓)
duckdb / parquet (1.043x ➖, 1↑ 3↓)
duckdb / duckdb (1.047x ➖, 0↑ 0↓)
No file size changes detected. Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.007x ➖, 1↑ 2↓)
datafusion / vortex-compact (1.005x ➖, 1↑ 1↓)
datafusion / parquet (1.004x ➖, 1↑ 2↓)
duckdb / vortex-file-compressed (1.001x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.000x ➖, 1↑ 0↓)
duckdb / parquet (1.000x ➖, 1↑ 0↓)
duckdb / duckdb (1.007x ➖, 1↑ 1↓)
No file size changes detected. Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.033x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.996x ➖, 0↑ 0↓)
datafusion / parquet (1.019x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.950x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.977x ➖, 0↑ 0↓)
duckdb / parquet (0.946x ➖, 0↑ 0↓)
Full attributed analysis
|
Merging this PR will degrade performance by 11.37%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | baseline_lt[16, 65536] |
217.4 µs | 245.2 µs | -11.37% |
| 🆕 | Simulation | chain_fsst[FewLong] |
N/A | 2 ms | N/A |
| 🆕 | Simulation | chain_view[FewLong] |
N/A | 893.1 µs | N/A |
| 🆕 | Simulation | single_filter_fsst[ManyShort] |
N/A | 2.2 ms | N/A |
| 🆕 | Simulation | single_filter_view[ManyShort] |
N/A | 1.6 ms | N/A |
| 🆕 | Simulation | chain_fsst[ManyShort] |
N/A | 17.8 ms | N/A |
| 🆕 | Simulation | chain_view[ManyShort] |
N/A | 13.7 ms | N/A |
| 🆕 | Simulation | single_filter_fsst[FewLong] |
N/A | 421.2 µs | N/A |
| 🆕 | Simulation | single_filter_view[FewLong] |
N/A | 322.1 µs | N/A |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/fsstview-conversion-floor-kRAeg (d1418cf) with develop (f7294db)
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) How to read Verdict and Engines
duckdb / vortex-file-compressed (1.031x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.030x ➖, 0↑ 0↓)
duckdb / parquet (1.046x ➖, 0↑ 0↓)
No file size changes detected. Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 0.947x ➖ How to read Verdict and Engines
unknown / unknown (1.000x ➖, 6↑ 2↓)
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.983x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.974x ➖, 0↑ 0↓)
datafusion / parquet (0.964x ➖, 0↑ 0↓)
datafusion / arrow (0.967x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.001x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.000x ➖, 0↑ 1↓)
duckdb / parquet (0.964x ➖, 4↑ 0↓)
duckdb / duckdb (0.995x ➖, 0↑ 0↓)
No file size changes detected. Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.991x ➖, 1↑ 0↓)
datafusion / parquet (1.009x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.973x ➖, 5↑ 0↓)
duckdb / parquet (1.001x ➖, 0↑ 1↓)
duckdb / duckdb (0.965x ➖, 4↑ 0↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.123x ➖, 0↑ 4↓)
datafusion / vortex-compact (0.929x ➖, 2↑ 0↓)
datafusion / parquet (1.024x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.905x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.902x ➖, 0↑ 0↓)
duckdb / parquet (0.966x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Appian on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.912x ➖, 3↑ 0↓)
datafusion / parquet (0.869x ✅, 4↑ 0↓)
duckdb / vortex-file-compressed (1.042x ➖, 0↑ 0↓)
duckdb / parquet (1.040x ➖, 0↑ 0↓)
duckdb / duckdb (1.037x ➖, 0↑ 0↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.000x ➖ How to read Verdict and Engines
unknown / unknown (0.999x ➖, 1↑ 1↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.875x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.953x ➖, 0↑ 0↓)
datafusion / parquet (0.908x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.878x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.941x ➖, 0↑ 0↓)
duckdb / parquet (0.882x ➖, 0↑ 0↓)
Full attributed analysis
|
FSSTView addresses its FSST-compressed codes with separate
offsetsandsizesarrays (like ListView) instead of FSST's single monotonic offsetsarray (like List/VarBin). Decoupling start from length means offsets need
not be monotonic or contiguous, so filter/take/slice become metadata-only:
they rewrite only the small offsets/sizes/lengths/validity arrays and reuse
the compressed byte heap and symbol table untouched.
This avoids the heap rewrite that plain FSST incurs on filter/take (which
delegate to VarBin), giving the same speed win ListView has over List.
vortex.fsstviewencoding in the fsst crate, reusing FSSTData for thesymbol table + compressed byte heap. Children are declared with the
#[array_slots(FSSTView)]proc macro (uncompressed_lengths, codes_offsets,codes_sizes, codes_validity).
bulk-decompresses into a VarBinView.
fsstview_from_fsstzero-copy conversion from an FSST array.register_default_encodings.filter/take/consistency conformance for nullable and non-nullable data.
Signed-off-by: Joe Isaacs joe.isaacs@live.co.uk<!--
Thank you for submitting a pull request! We appreciate your time and effort.
Please make sure to provide enough information so that we can review your pull
request. The Summary and Testing sections below contain guidance on what to
include.
-->
Summary
Closes: #000
Testing