Skip to content

About initial data on elasticsearch validate performance without vector columns #798

Description

@szhong20

Hello,
I did a benchmark with VectorDBbenchmark on elasticsearch. After data initialized, there isn't vector data inside index from elasticsearch. I also read the download data template, the vector column exists in it. is there any information I missed during the initialization.
Command: "vectordbbench elasticcloudhnsw --case-type Performance768D1M --k 10 --host <elastic_hostname> --port 9200 --user elastic --password --m 16 --ef-construction 200 --search-concurrent --load-concurrency 8 --num-concurrency 1,10,50,100 --scheme http"
Output:
2026-06-15 10:07:22,850 | INFO: Task:
TaskConfig(db=<DB.ElasticCloud: 'ElasticCloud'>, db_config=ElasticCloudConfig(db_label='2026-06-15T10:07:22.760305', version='', note='', cloud_id=None, scheme='http', host='es-cn-nyw4tu5hi0001yfnk.elasticsearch.aliyuncs.com', port=9200, user='elastic', password=SecretStr('**********')), db_case_config=ElasticCloudIndexConfig(element_type=<ESElementType.float: 'float'>, index=<IndexType.ES_HNSW: 'hnsw'>, number_of_shards=1, number_of_replicas=0, refresh_interval='30s', merge_max_thread_count=8, use_rescore=False, oversample_ratio=2.0, use_routing=False, use_force_merge=True, metric_type=None, efConstruction=200, M=16, num_candidates=100), case_config=CaseConfig(case_id=<CaseType.Performance768D1M: 5>, custom_case={}, k=10, concurrency_search_config=ConcurrencySearchConfig(num_concurrency=[1, 10, 50, 100], concurrency_duration=30, concurrency_timeout=3600)), stages=['drop_old', 'load', 'search_serial', 'search_concurrent'], load_concurrency=4)
(cli.py:659) (3145)
2026-06-15 10:07:22,851 | INFO: generated uuid for the tasks: 070569c9f508415f980745148b566b32 (interface.py:73) (3145)
2026-06-15 10:07:23,216 | INFO | DB | CaseType Dataset Filter | task_label (task_runner.py:411)
2026-06-15 10:07:23,217 | INFO | ----------- | ------------ -------------------- ------- | ------- (task_runner.py:411)
2026-06-15 10:07:23,217 | INFO | ElasticCloud-2026-06-15T10:07:22.760305 | Performance Cohere-MEDIUM-1M 0.0 | 070569c9f508415f980745148b566b32 (task_runner.py:411)
2026-06-15 10:07:23,217 | INFO: task submitted: id=070569c9f508415f980745148b566b32, 070569c9f508415f980745148b566b32, case number: 1 (interface.py:248) (3145)
2026-06-15 10:07:23,826 | INFO: [1/1] start case: {'label': <CaseLabel.Performance: 2>, 'name': 'Search Performance Test (1M Dataset, 768 Dim)', 'dataset': {'data': {'name': 'Cohere', 'size': 1000000, 'dim': 768, 'metric_type': <MetricType.COSINE: 'COSINE'>}}, 'db': 'ElasticCloud-2026-06-15T10:07:22.760305'}, drop_old=True (interface.py:178) (3180)
2026-06-15 10:07:23,827 | INFO: Starting run (task_runner.py:149) (3180)
2026-06-15 10:07:23,958 | INFO: Elasticsearch client drop_old indices: vdb_bench_indice (elastic_cloud.py:56) (3180)
2026-06-15 10:07:25,823 | INFO: Read the entire file into memory: test.parquet (dataset.py:394) (3180)
2026-06-15 10:07:25,923 | INFO: Read the entire file into memory: neighbors.parquet (dataset.py:394) (3180)
2026-06-15 10:07:25,975 | INFO: Start performance case (task_runner.py:194) (3180)
2026-06-15 10:07:26,839 | INFO: (SpawnProcess-1:1) Start concurrent insert, batch_size=100, max_workers=4 (concurrent_runner.py:187) (3320)
2026-06-15 10:07:26,840 | INFO: Get iterator for shuffle_train.parquet (dataset.py:426) (3320)
2026-06-15 10:19:04,362 | INFO: (SpawnProcess-1:1) Finish concurrent insert, count=1000000, dur=697.52s (concurrent_runner.py:208) (3320)
2026-06-15 10:19:05,418 | INFO: Elasticsearch force merge task id: IzIQWvsDRDyJfYi3ILG6IA:25254 (elastic_cloud.py:216) (3374)
2026-06-15 10:36:05,759 | INFO: Finish loading the entire dataset into VectorDB, insert_duration=698.5211805580002, optimize_duration=1020.3182846210002 load_duration(insert + optimize) = 1718.8395 (task_runner.py:204) (3180)
2026-06-15 10:36:06,372 | INFO: Start search 30s in concurrency 1, filters: type=<FilterOp.NonFilter: 'NonFilter'> filter_rate=0.0 gt_file_name='neighbors.parquet' (mp_runner.py:129) (3180)

there is two columns "id" and "emb" inside shuffle_train.parquet, but didn't find the "emb" column inside the above index.

$ parquet-tools csv --head 1 shuffle_train.parquet
id,emb
322406,"[ 1.96000963e-01 -5.27086198e-01 -2.95191228e-01 4.29556400e-01
5.14418483e-01 3.23285192e-01 4.47883815e-01 -2.47427240e-01
2.17925444e-01 2.95179904e-01 -1.87991694e-01 -1.45452484e-01
-7.53609417e-03 2.48572137e-02 -2.38947198e-01 -5.72574914e-01
2.85768330e-01 -2.50302762e-01 -1.09715998e-01 2.03979433e-01
-2.87425637e-01 4.39991504e-01 -4.32560384e-01 -1.68661028e-02
-1.18690394e-01 -1.56994104e-01 -3.84647399e-01 -2.81384345e-02
-7.62783408e-01 3.80847305e-01 9.49241042e-01 -3.09303999e-01
3.34682524e-01 4.52350616e-01 -6.91890001e-01 2.17385769e-01
-1.60764053e-01 9.34349224e-02 -6.08903706e-01 3.95501107e-01
4.59643811e-01 1.34819821e-02 5.26180983e-01 -2.78248936e-01
4.71442789e-01 -4.53977764e-01 4.71780390e-01 -1.68278441e-01
6.41057193e-02 2.62458622e-01 -1.20296814e-01 -4.32358563e-01
-5.24910808e-01 1.35188848e-01 -3.00156236e-01 -9.81063619e-02
...

Image

if there isn't vector column inside the index, How dose the vector performance can be validated?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions