What happened?
With the compact compression strategy, we currently only consider utf-8 data types: https://github.com/vortex-data/vortex/blob/0.73.0/vortex-btrblocks/src/schemes/string.rs#L238-L240
Zstd encoding should be able to work on any binary data.
Steps to reproduce
import os
import pyarrow as pa
import vortex as vx
n = 1_000_000
value = "abcd" * 16 # 64 bytes, valid UTF-8, trivially compressible
raw_bytes = n * len(value.encode())
for name, arr in [
("utf8", pa.array([value] * n, pa.string())),
("binary", pa.array([value.encode()] * n, pa.binary())),
]:
path = f"{name}.vortex"
vx.io.VortexWriteOptions.compact().write(pa.table({"col": arr}), path)
size = os.path.getsize(path)
print(f"{name:6s} compact: {size:>12,} bytes (raw {raw_bytes:,}, ratio {size / raw_bytes:.4f})")
gives:
utf8 compact: 16,060 bytes (raw 64,000,000, ratio 0.0003)
binary compact: 80,032,676 bytes (raw 64,000,000, ratio 1.2505)
Environment
Vortex 0.73.0
Python 3.12
MacOSX
Additional context
No response
What happened?
With the compact compression strategy, we currently only consider utf-8 data types: https://github.com/vortex-data/vortex/blob/0.73.0/vortex-btrblocks/src/schemes/string.rs#L238-L240
Zstd encoding should be able to work on any binary data.
Steps to reproduce
gives:
Environment
Vortex 0.73.0
Python 3.12
MacOSX
Additional context
No response