Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 41 additions & 8 deletions random_vector/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,27 @@
# Random Vector Track

This track is intended for benchmarking filtered vector search using randomly generated vectors.
This track is intended for benchmarking filtered vector search using randomly generated vectors in a **multi-partition** setup.
By default, it uses the `bbq_flat` `vector_index_type` to evaluate the performance of brute-force search with partition ID-based filtering.

The `paragraph_size` parameter determines how many random vectors are indexed per document.

* If `paragraph_size` is set to `1` (the default), each document contains a single top-level random vector.
* If `paragraph_size` is greater than `1`, that number of random vectors is indexed as nested fields within each document.

## Multi-Partition Model

Partitions are organized into three tiers with configurable counts:

* **Small partitions** (`small_partitions`, default: 100): 1,000–10,000 documents each
* **Medium partitions** (`medium_partitions`, default: 20): 10,000–100,000 documents each
* **Large partitions** (`large_partitions`, default: 5): 100,000–1,000,000 documents each

The distribution follows a realistic pattern: many small partitions, fewer medium, and fewest large.

Each partition's exact document count is determined by a seeded RNG (`partition_seed`, default: 42), ensuring reproducible runs. During indexing, documents are assigned to partitions via weighted random sampling proportional to each partition's target size.

The index is sorted by `partition_id` and documents are routed by `partition_id`, keeping each partition's data co-located.

## Indexing

Indexing runs in one of two modes, depending on whether `index_target_throughput` is specified.
Expand All @@ -25,15 +39,31 @@ The total number of documents indexed is:
Each document indexed includes:

* A random vector with `dims` dimensions.
* A randomly assigned partition ID.
* A partition ID assigned via weighted random selection.

The index is sorted by partition ID.
The index is sorted by partition ID and documents are routed by partition ID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optionally routed

This ensures that vectors from the same partition are stored close together, improving the efficiency of filtered searches.

## Search Operations

Search operations involve filtering by a random partition ID and scoring against a random query vector.
These operations are executed against the index using various DSL flavors, including script score and knn section.
Search tasks are broken up by partition tier to separately measure QPS and latency for small, medium, and large partitions:

* `small-partition-search`: Queries only small-tier partitions
* `medium-partition-search`: Queries only medium-tier partitions
* `large-partition-search`: Queries only large-tier partitions

Each search phase filters by a randomly chosen partition ID within the tier and scores against a random query vector.

## Nightly Benchmarking

For nightly runs, use the following recommended parameters:

```
--track-params="dims:1024,vector_index_type:bbq_flat"
--track-params="dims:1024,vector_index_type:bbq_disk"
```

Run both `bbq_flat` and `bbq_disk` to capture performance on both index types.

### Parameters

Expand All @@ -47,10 +77,13 @@ This track accepts the following parameters with Rally 0.8.0+ using `--track-par
- index_clients (default: 1)
- index_iterations (default: 1000)
- index_bulk_size (default: 1000)
- search_iterations (default: 1000)
- search_iterations (default: 10000)
- search_clients (default: 8)
- dims (default: 128)
- partitions (default: 1000)
- dims (default: 128): Number of dimensions for the random vectors. Use 1024 for nightly runs.
- small_partitions (default: 100): Number of small partitions (1k–10k docs each).
- medium_partitions (default: 20): Number of medium partitions (10k–100k docs each).
- large_partitions (default: 5): Number of large partitions (100k–1M docs each).
- partition_seed (default: 42): Seed for deterministic partition size assignment.
- rescore_oversample (default: 0)
- vector_index_element_type (default: "float"): Sets the dense_vector element type.
- enable_experimental_features (default: false): Enables experimental dense vector features that may break backward compatibility.
27 changes: 20 additions & 7 deletions random_vector/challenges/default.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "index-and-search",
"description": "Create an index and index doc with random content into it.",
"description": "Index documents with variable-sized partitions and benchmark filtered knn search per partition tier.",
"default": true,
"schedule": [
{
Expand Down Expand Up @@ -44,14 +44,27 @@
"request-timeout": 1000,
"include-in-reporting": true
}
},
}{%- if small_partitions | default(100) | int > 0 %},
{
"name": "small-partition-search",
"operation": "brute-force-filtered-search-small-partition",
"warmup-iterations": 500,
"iterations": {{ search_iterations | default(10000) | int }},
"clients": {{ search_clients | default(8) | int }}
}{%- endif %}{%- if medium_partitions | default(20) | int > 0 %},
{
"name": "medium-partition-search",
"operation": "brute-force-filtered-search-medium-partition",
"warmup-iterations": 500,
"iterations": {{ search_iterations | default(10000) | int }},
"clients": {{ search_clients | default(8) | int }}
}{%- endif %}{%- if large_partitions | default(5) | int > 0 %},
{
"name": "brute-force-filtered-search",
"operation": "brute-force-filtered-search",
"script": false,
"warmup-iterations": 1000,
"name": "large-partition-search",
"operation": "brute-force-filtered-search-large-partition",
"warmup-iterations": 500,
"iterations": {{ search_iterations | default(10000) | int }},
"clients": {{ search_clients | default(8) | int }}
}
}{%- endif %}
]
}
7 changes: 6 additions & 1 deletion random_vector/index-template.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
{
"index_patterns": ["vectors-benchmark-*"],
"priority": 500,
"data_stream": {},
"data_stream": {
"allow_custom_routing": true
},
"template": {
"settings": {
{# non-serverless-index-settings-marker-start #}{%- if build_flavor != "serverless" or serverless_operator == true -%}
Expand All @@ -14,6 +16,9 @@
}
},
"mappings": {
"_routing": {
"required": true
},
Comment on lines +19 to +21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, cannot do this. This needs to be completely optional for now. routing is not allowed in serverless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maybe have a routing template and a non-routing template and the rally runner can pick the correct one given configuration.

"properties": {
"@timestamp": {
"type": "date"
Expand Down
61 changes: 59 additions & 2 deletions random_vector/operations/default.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,10 @@
"operation-type": "bulk",
"param-source": "random-bulk-param-source",
"dims": {{ dims | default(128) | int }},
"partitions": {{ partitions | default(1000) | int }},
"small-partitions": {{ small_partitions | default(100) | int }},
"medium-partitions": {{ medium_partitions | default(20) | int }},
"large-partitions": {{ large_partitions | default(5) | int }},
"partition-seed": {{ partition_seed | default(42) | int }},
"bulk-size": {{ index_bulk_size | default(1000) | int}},
"paragraph-size": {{paragraph_size | default(1) | int}}
},
Expand All @@ -21,7 +24,61 @@
"operation-type": "search",
"param-source": "knn-param-source",
"dims": {{ dims | default(128) | int }},
"partitions": {{ partitions | default(1000) | int }},
"small-partitions": {{ small_partitions | default(100) | int }},
"medium-partitions": {{ medium_partitions | default(20) | int }},
"large-partitions": {{ large_partitions | default(5) | int }},
"partition-seed": {{ partition_seed | default(42) | int }},
"rescore-oversample": {{ rescore_oversample | default(0) | int }},
{%- if paragraph_size | default(1) | int > 1 -%}
"field": "nested.emb"
{%- else %}
"field": "emb"
{%- endif %}
},
{
"name": "brute-force-filtered-search-small-partition",
"operation-type": "search",
"param-source": "knn-param-source",
"dims": {{ dims | default(128) | int }},
"small-partitions": {{ small_partitions | default(100) | int }},
"medium-partitions": {{ medium_partitions | default(20) | int }},
"large-partitions": {{ large_partitions | default(5) | int }},
"partition-seed": {{ partition_seed | default(42) | int }},
"partition-tier": "small",
"rescore-oversample": {{ rescore_oversample | default(0) | int }},
{%- if paragraph_size | default(1) | int > 1 -%}
"field": "nested.emb"
{%- else %}
"field": "emb"
{%- endif %}
},
{
"name": "brute-force-filtered-search-medium-partition",
"operation-type": "search",
"param-source": "knn-param-source",
"dims": {{ dims | default(128) | int }},
"small-partitions": {{ small_partitions | default(100) | int }},
"medium-partitions": {{ medium_partitions | default(20) | int }},
"large-partitions": {{ large_partitions | default(5) | int }},
"partition-seed": {{ partition_seed | default(42) | int }},
"partition-tier": "medium",
"rescore-oversample": {{ rescore_oversample | default(0) | int }},
{%- if paragraph_size | default(1) | int > 1 -%}
"field": "nested.emb"
{%- else %}
"field": "emb"
{%- endif %}
},
{
"name": "brute-force-filtered-search-large-partition",
"operation-type": "search",
"param-source": "knn-param-source",
"dims": {{ dims | default(128) | int }},
"small-partitions": {{ small_partitions | default(100) | int }},
"medium-partitions": {{ medium_partitions | default(20) | int }},
"large-partitions": {{ large_partitions | default(5) | int }},
Comment on lines +77 to +79
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this passed in on every task?

It seems like we could just name the partitions small-N, medium-N, large-N

"partition-seed": {{ partition_seed | default(42) | int }},
"partition-tier": "large",
"rescore-oversample": {{ rescore_oversample | default(0) | int }},
{%- if paragraph_size | default(1) | int > 1 -%}
"field": "nested.emb"
Expand Down
107 changes: 92 additions & 15 deletions random_vector/track.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,94 @@
import random
import time
from bisect import bisect_left

from esrally.track.params import ParamSource

TIER_SMALL = "small"
TIER_MEDIUM = "medium"
TIER_LARGE = "large"
TIERS = (TIER_SMALL, TIER_MEDIUM, TIER_LARGE)

# Size ranges per tier (document counts)
TIER_RANGES = {
TIER_SMALL: (1000, 10000),
TIER_MEDIUM: (10000, 100000),
TIER_LARGE: (100000, 1000000),
}


def build_partition_registry(small_partitions, medium_partitions, large_partitions, partition_seed):
"""
Build a deterministic partition registry from the given counts and seed.
Returns a list of (partition_id, target_size, tier) tuples and a list of
cumulative weights for weighted random selection during indexing.
"""
rng = random.Random(partition_seed)
partitions = []
cumulative_weights = []
cumulative_weight = 0
partition_id = 0
for tier, count in [(TIER_SMALL, small_partitions), (TIER_MEDIUM, medium_partitions), (TIER_LARGE, large_partitions)]:
lo, hi = TIER_RANGES[tier]
for _ in range(count):
target_size = rng.randint(lo, hi)
partitions.append((str(partition_id), target_size, tier))
cumulative_weight += target_size
cumulative_weights.append(cumulative_weight)
partition_id += 1

if not partitions:
raise ValueError("At least one partition must be configured")

return partitions, cumulative_weights


def extract_partition_config(params):
small = params.get("small-partitions", 100)
medium = params.get("medium-partitions", 20)
large = params.get("large-partitions", 5)
seed = params.get("partition-seed", 42)

for name, value in (("small-partitions", small), ("medium-partitions", medium), ("large-partitions", large)):
if value < 0:
raise ValueError(f"{name} must be non-negative")

if small + medium + large == 0:
raise ValueError("At least one partition must be configured")

return small, medium, large, seed


def pick_partition(partitions, cumulative_weights):
"""Select a partition using weighted random sampling (proportional to target size)."""
partition_index = bisect_left(cumulative_weights, random.randint(1, cumulative_weights[-1]))
return partitions[partition_index]


class RandomBulkParamSource(ParamSource):
def __init__(self, track, params, **kwargs):
super().__init__(track, params, **kwargs)
self._bulk_size = params.get("bulk-size", 1000)
self._index_name = track.data_streams[0].name
self._dims = params.get("dims", 128)
self._partitions = params.get("partitions", 1000)
self._paragraph_size = params.get("paragraph-size", 1)

small, medium, large, seed = extract_partition_config(params)
self._partitions, self._cumulative = build_partition_registry(small, medium, large, seed)

def params(self):
import numpy as np

timestamp = int(time.time()) * 1000
bulk_data = []
for _ in range(self._bulk_size):
partition_id = random.randint(0, self._partitions)
metadata = {"_index": self._index_name}
partition_id, _, _ = pick_partition(self._partitions, self._cumulative)
metadata = {"_index": self._index_name, "routing": partition_id}
bulk_data.append({"create": metadata})
doc = {"@timestamp": timestamp, "partition_id": partition_id}
if self._paragraph_size > 1:
nested_vec = []
for i in range(0, self._paragraph_size):
for i in range(self._paragraph_size):
nested_vec.append({"emb": np.random.rand(self._dims).tolist(), "paragraph_id": i})
doc["nested"] = nested_vec
else:
Expand All @@ -43,27 +106,40 @@ def params(self):


def generate_knn_query(field_name, query_vector, partition_id, k, rescore_oversample):
return {
"knn": {
"field": field_name,
"query_vector": query_vector,
"k": k,
"num_candidates": k,
"filter": {"term": {"partition_id": partition_id}},
"rescore_vector": {"oversample": rescore_oversample},
},
knn_query = {
"field": field_name,
"query_vector": query_vector,
"k": k,
"num_candidates": k,
"filter": {"term": {"partition_id": partition_id}},
}

if rescore_oversample > 0:
knn_query["rescore_vector"] = {"oversample": rescore_oversample}
Comment on lines +117 to +118
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rescore value of 0 indicates NO rescore. -1 should indicate that the index default is used.


return {"knn": knn_query}


class RandomSearchParamSource:
def __init__(self, track, params, **kwargs):
self._index_name = track.data_streams[0].name
self._cache = params.get("cache", False)
self._field = params.get("field", "emb")
self._partitions = params.get("partitions", 1000)
self._dims = params.get("dims", 128)
self._top_k = params.get("k", 10)
self._rescore_oversample = params.get("rescore-oversample", 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default should be -1 indicating that index default is used. 0 should indicate no rescore.


small, medium, large, seed = extract_partition_config(params)
self._partitions, _ = build_partition_registry(small, medium, large, seed)

partition_tier = params.get("partition-tier", None)
if partition_tier is not None:
if partition_tier not in TIERS:
raise ValueError(f"partition-tier must be one of: {', '.join(TIERS)}")
self._tier_partitions = [p for p in self._partitions if p[2] == partition_tier]
else:
self._tier_partitions = self._partitions

self.infinite = True

def partition(self, partition_index, total_partitions):
Expand All @@ -72,7 +148,8 @@ def partition(self, partition_index, total_partitions):
def params(self):
import numpy as np

partition_id = random.randint(0, self._partitions)
partition = random.choice(self._tier_partitions)
partition_id = partition[0]
query_vec = np.random.rand(self._dims).tolist()
query = generate_knn_query(self._field, query_vec, partition_id, self._top_k, self._rescore_oversample)
return {"index": self._index_name, "cache": self._cache, "size": self._top_k, "body": query}
Expand Down
Loading