Hello,
I did a benchmark with VectorDBbenchmark on elasticsearch. After data initialized, there isn't vector data inside index from elasticsearch. I also read the download data template, the vector column exists in it. is there any information I missed during the initialization.
Command: "vectordbbench elasticcloudhnsw --case-type Performance768D1M --k 10 --host <elastic_hostname> --port 9200 --user elastic --password --m 16 --ef-construction 200 --search-concurrent --load-concurrency 8 --num-concurrency 1,10,50,100 --scheme http"
Output:
2026-06-15 10:07:22,850 | INFO: Task:
TaskConfig(db=<DB.ElasticCloud: 'ElasticCloud'>, db_config=ElasticCloudConfig(db_label='2026-06-15T10:07:22.760305', version='', note='', cloud_id=None, scheme='http', host='es-cn-nyw4tu5hi0001yfnk.elasticsearch.aliyuncs.com', port=9200, user='elastic', password=SecretStr('**********')), db_case_config=ElasticCloudIndexConfig(element_type=<ESElementType.float: 'float'>, index=<IndexType.ES_HNSW: 'hnsw'>, number_of_shards=1, number_of_replicas=0, refresh_interval='30s', merge_max_thread_count=8, use_rescore=False, oversample_ratio=2.0, use_routing=False, use_force_merge=True, metric_type=None, efConstruction=200, M=16, num_candidates=100), case_config=CaseConfig(case_id=<CaseType.Performance768D1M: 5>, custom_case={}, k=10, concurrency_search_config=ConcurrencySearchConfig(num_concurrency=[1, 10, 50, 100], concurrency_duration=30, concurrency_timeout=3600)), stages=['drop_old', 'load', 'search_serial', 'search_concurrent'], load_concurrency=4)
(cli.py:659) (3145)
2026-06-15 10:07:22,851 | INFO: generated uuid for the tasks: 070569c9f508415f980745148b566b32 (interface.py:73) (3145)
2026-06-15 10:07:23,216 | INFO | DB | CaseType Dataset Filter | task_label (task_runner.py:411)
2026-06-15 10:07:23,217 | INFO | ----------- | ------------ -------------------- ------- | ------- (task_runner.py:411)
2026-06-15 10:07:23,217 | INFO | ElasticCloud-2026-06-15T10:07:22.760305 | Performance Cohere-MEDIUM-1M 0.0 | 070569c9f508415f980745148b566b32 (task_runner.py:411)
2026-06-15 10:07:23,217 | INFO: task submitted: id=070569c9f508415f980745148b566b32, 070569c9f508415f980745148b566b32, case number: 1 (interface.py:248) (3145)
2026-06-15 10:07:23,826 | INFO: [1/1] start case: {'label': <CaseLabel.Performance: 2>, 'name': 'Search Performance Test (1M Dataset, 768 Dim)', 'dataset': {'data': {'name': 'Cohere', 'size': 1000000, 'dim': 768, 'metric_type': <MetricType.COSINE: 'COSINE'>}}, 'db': 'ElasticCloud-2026-06-15T10:07:22.760305'}, drop_old=True (interface.py:178) (3180)
2026-06-15 10:07:23,827 | INFO: Starting run (task_runner.py:149) (3180)
2026-06-15 10:07:23,958 | INFO: Elasticsearch client drop_old indices: vdb_bench_indice (elastic_cloud.py:56) (3180)
2026-06-15 10:07:25,823 | INFO: Read the entire file into memory: test.parquet (dataset.py:394) (3180)
2026-06-15 10:07:25,923 | INFO: Read the entire file into memory: neighbors.parquet (dataset.py:394) (3180)
2026-06-15 10:07:25,975 | INFO: Start performance case (task_runner.py:194) (3180)
2026-06-15 10:07:26,839 | INFO: (SpawnProcess-1:1) Start concurrent insert, batch_size=100, max_workers=4 (concurrent_runner.py:187) (3320)
2026-06-15 10:07:26,840 | INFO: Get iterator for shuffle_train.parquet (dataset.py:426) (3320)
2026-06-15 10:19:04,362 | INFO: (SpawnProcess-1:1) Finish concurrent insert, count=1000000, dur=697.52s (concurrent_runner.py:208) (3320)
2026-06-15 10:19:05,418 | INFO: Elasticsearch force merge task id: IzIQWvsDRDyJfYi3ILG6IA:25254 (elastic_cloud.py:216) (3374)
2026-06-15 10:36:05,759 | INFO: Finish loading the entire dataset into VectorDB, insert_duration=698.5211805580002, optimize_duration=1020.3182846210002 load_duration(insert + optimize) = 1718.8395 (task_runner.py:204) (3180)
2026-06-15 10:36:06,372 | INFO: Start search 30s in concurrency 1, filters: type=<FilterOp.NonFilter: 'NonFilter'> filter_rate=0.0 gt_file_name='neighbors.parquet' (mp_runner.py:129) (3180)
there is two columns "id" and "emb" inside shuffle_train.parquet, but didn't find the "emb" column inside the above index.
$ parquet-tools csv --head 1 shuffle_train.parquet
id,emb
322406,"[ 1.96000963e-01 -5.27086198e-01 -2.95191228e-01 4.29556400e-01
5.14418483e-01 3.23285192e-01 4.47883815e-01 -2.47427240e-01
2.17925444e-01 2.95179904e-01 -1.87991694e-01 -1.45452484e-01
-7.53609417e-03 2.48572137e-02 -2.38947198e-01 -5.72574914e-01
2.85768330e-01 -2.50302762e-01 -1.09715998e-01 2.03979433e-01
-2.87425637e-01 4.39991504e-01 -4.32560384e-01 -1.68661028e-02
-1.18690394e-01 -1.56994104e-01 -3.84647399e-01 -2.81384345e-02
-7.62783408e-01 3.80847305e-01 9.49241042e-01 -3.09303999e-01
3.34682524e-01 4.52350616e-01 -6.91890001e-01 2.17385769e-01
-1.60764053e-01 9.34349224e-02 -6.08903706e-01 3.95501107e-01
4.59643811e-01 1.34819821e-02 5.26180983e-01 -2.78248936e-01
4.71442789e-01 -4.53977764e-01 4.71780390e-01 -1.68278441e-01
6.41057193e-02 2.62458622e-01 -1.20296814e-01 -4.32358563e-01
-5.24910808e-01 1.35188848e-01 -3.00156236e-01 -9.81063619e-02
...
if there isn't vector column inside the index, How dose the vector performance can be validated?
Hello,
I did a benchmark with VectorDBbenchmark on elasticsearch. After data initialized, there isn't vector data inside index from elasticsearch. I also read the download data template, the vector column exists in it. is there any information I missed during the initialization.
Command: "vectordbbench elasticcloudhnsw --case-type Performance768D1M --k 10 --host <elastic_hostname> --port 9200 --user elastic --password --m 16 --ef-construction 200 --search-concurrent --load-concurrency 8 --num-concurrency 1,10,50,100 --scheme http"
Output:
2026-06-15 10:07:22,850 | INFO: Task:
TaskConfig(db=<DB.ElasticCloud: 'ElasticCloud'>, db_config=ElasticCloudConfig(db_label='2026-06-15T10:07:22.760305', version='', note='', cloud_id=None, scheme='http', host='es-cn-nyw4tu5hi0001yfnk.elasticsearch.aliyuncs.com', port=9200, user='elastic', password=SecretStr('**********')), db_case_config=ElasticCloudIndexConfig(element_type=<ESElementType.float: 'float'>, index=<IndexType.ES_HNSW: 'hnsw'>, number_of_shards=1, number_of_replicas=0, refresh_interval='30s', merge_max_thread_count=8, use_rescore=False, oversample_ratio=2.0, use_routing=False, use_force_merge=True, metric_type=None, efConstruction=200, M=16, num_candidates=100), case_config=CaseConfig(case_id=<CaseType.Performance768D1M: 5>, custom_case={}, k=10, concurrency_search_config=ConcurrencySearchConfig(num_concurrency=[1, 10, 50, 100], concurrency_duration=30, concurrency_timeout=3600)), stages=['drop_old', 'load', 'search_serial', 'search_concurrent'], load_concurrency=4)
(cli.py:659) (3145)
2026-06-15 10:07:22,851 | INFO: generated uuid for the tasks: 070569c9f508415f980745148b566b32 (interface.py:73) (3145)
2026-06-15 10:07:23,216 | INFO | DB | CaseType Dataset Filter | task_label (task_runner.py:411)
2026-06-15 10:07:23,217 | INFO | ----------- | ------------ -------------------- ------- | ------- (task_runner.py:411)
2026-06-15 10:07:23,217 | INFO | ElasticCloud-2026-06-15T10:07:22.760305 | Performance Cohere-MEDIUM-1M 0.0 | 070569c9f508415f980745148b566b32 (task_runner.py:411)
2026-06-15 10:07:23,217 | INFO: task submitted: id=070569c9f508415f980745148b566b32, 070569c9f508415f980745148b566b32, case number: 1 (interface.py:248) (3145)
2026-06-15 10:07:23,826 | INFO: [1/1] start case: {'label': <CaseLabel.Performance: 2>, 'name': 'Search Performance Test (1M Dataset, 768 Dim)', 'dataset': {'data': {'name': 'Cohere', 'size': 1000000, 'dim': 768, 'metric_type': <MetricType.COSINE: 'COSINE'>}}, 'db': 'ElasticCloud-2026-06-15T10:07:22.760305'}, drop_old=True (interface.py:178) (3180)
2026-06-15 10:07:23,827 | INFO: Starting run (task_runner.py:149) (3180)
2026-06-15 10:07:23,958 | INFO: Elasticsearch client drop_old indices: vdb_bench_indice (elastic_cloud.py:56) (3180)
2026-06-15 10:07:25,823 | INFO: Read the entire file into memory: test.parquet (dataset.py:394) (3180)
2026-06-15 10:07:25,923 | INFO: Read the entire file into memory: neighbors.parquet (dataset.py:394) (3180)
2026-06-15 10:07:25,975 | INFO: Start performance case (task_runner.py:194) (3180)
2026-06-15 10:07:26,839 | INFO: (SpawnProcess-1:1) Start concurrent insert, batch_size=100, max_workers=4 (concurrent_runner.py:187) (3320)
2026-06-15 10:07:26,840 | INFO: Get iterator for shuffle_train.parquet (dataset.py:426) (3320)
2026-06-15 10:19:04,362 | INFO: (SpawnProcess-1:1) Finish concurrent insert, count=1000000, dur=697.52s (concurrent_runner.py:208) (3320)
2026-06-15 10:19:05,418 | INFO: Elasticsearch force merge task id: IzIQWvsDRDyJfYi3ILG6IA:25254 (elastic_cloud.py:216) (3374)
2026-06-15 10:36:05,759 | INFO: Finish loading the entire dataset into VectorDB, insert_duration=698.5211805580002, optimize_duration=1020.3182846210002 load_duration(insert + optimize) = 1718.8395 (task_runner.py:204) (3180)
2026-06-15 10:36:06,372 | INFO: Start search 30s in concurrency 1, filters: type=<FilterOp.NonFilter: 'NonFilter'> filter_rate=0.0 gt_file_name='neighbors.parquet' (mp_runner.py:129) (3180)
there is two columns "id" and "emb" inside shuffle_train.parquet, but didn't find the "emb" column inside the above index.
$ parquet-tools csv --head 1 shuffle_train.parquet
id,emb
322406,"[ 1.96000963e-01 -5.27086198e-01 -2.95191228e-01 4.29556400e-01
5.14418483e-01 3.23285192e-01 4.47883815e-01 -2.47427240e-01
2.17925444e-01 2.95179904e-01 -1.87991694e-01 -1.45452484e-01
-7.53609417e-03 2.48572137e-02 -2.38947198e-01 -5.72574914e-01
2.85768330e-01 -2.50302762e-01 -1.09715998e-01 2.03979433e-01
-2.87425637e-01 4.39991504e-01 -4.32560384e-01 -1.68661028e-02
-1.18690394e-01 -1.56994104e-01 -3.84647399e-01 -2.81384345e-02
-7.62783408e-01 3.80847305e-01 9.49241042e-01 -3.09303999e-01
3.34682524e-01 4.52350616e-01 -6.91890001e-01 2.17385769e-01
-1.60764053e-01 9.34349224e-02 -6.08903706e-01 3.95501107e-01
4.59643811e-01 1.34819821e-02 5.26180983e-01 -2.78248936e-01
4.71442789e-01 -4.53977764e-01 4.71780390e-01 -1.68278441e-01
6.41057193e-02 2.62458622e-01 -1.20296814e-01 -4.32358563e-01
-5.24910808e-01 1.35188848e-01 -3.00156236e-01 -9.81063619e-02
...
if there isn't vector column inside the index, How dose the vector performance can be validated?