[WIP] Use `itertools.cycle` to iterate over random queries from the query file by fressi-elastic · Pull Request #768 · elastic/rally-tracks

fressi-elastic · 2025-03-31T12:33:22Z

This is a follows up over PR #767

gareth-ellis · 2025-03-31T17:56:03Z

I need to have a go at running this, but my initial reaction is that I think this changes the behaviour of the track. In rally we have two modes of running a task - either time based, or "unit" based - where we would in the track say run this for 10 minutes, or run this 200 times - by not allowing the iterator to renew, it means that if a user says run this 12000 times, then actually it will only run 10000 times, since the query file has 10000 rows.

As i say, I have only very quickly looked at this - i need to run myself to validate.

AI-IshanBhatt

Parentheses in the dataclass decoratore

wikipedia/tests/test_track.py

…ile.

…ucibility.

…nce.

…ueries

gbanasiak

This looks good overall. I've asked to document the default random seed.

I'd like to hear @elastic/search-relevance opinion. Would you like to pursue with this wikipedia change? The benefits I see are:

random seed pinning for potentially better reproducibility of results
the use of itertools.cycle makes code simpler as there's no need to handle iterator exhaustion
unit tests might help us avoid unintended changes in the future but you might see it as unnecessary boilerplate.

gbanasiak · 2025-06-09T15:38:34Z

wikipedia/track.py

 QUERY_RULES_ENDPOINT: str = "/_query_rules"

 QUERY_CLEAN_REXEXP = regexp = re.compile("[^0-9a-zA-Z]+")
+DEFAULT_SEED = hash(__name__)


I like the idea of pinning the seed. We're doing something similar in elastic/logs through default track parameter value:

rally-tracks/elastic/logs/README.md

Line 242 in f3e4ad5

* `random_seed` (default: 13) - Files are generated through random sampling of the source corpora. This pseudo random selection process is seeded to ensure multiple runs of the track generate the same data - thus ensuring tests are repeatable. Changing this value or `data_generation_clients` will cause the generation of a different dataset. Must be an integer.

rally-tracks/elastic/logs/track.json

Line 155 in f3e4ad5

"random-seed": {{ random_seed | default(13) | int }},

Can we document the default in track's README?

gbanasiak · 2025-06-09T15:42:56Z

wikipedia/track.py

-        self._random_seed = self._params.get("seed", None)
-        self._sample_queries = query_samples(self._batch_size, self._random_seed)
-        self._queries_iterator = None
+        self._queries_it = itertools.cycle(query_samples(k=self._params.get("batch_size", 100000), seed=self._params.get("seed")))


Back to @gareth-ellis point I think this is correct, i.e. itertools.cycle will cycle back to the beginning of an array returned by query_samples() which is what current code is doing. It's convenient at the cost of doubling memory consumption because itertools.cycle stores a local copy of an array I think (see https://docs.python.org/3/library/itertools.html#itertools.cycle).

The queries.csv has 10k lines, and we're asking for x10 more as the default batch_size/k is 100k. Assuming pure ASCII (1 byte / character), and string length of 32 characters on average, that's an increase in the region of 128k * 32 ~ 4MB. I think that's OK.

wikipedia/track.py

kderusso

Hey, from a Search Relevance team perspective, randomization like this sounds like a good idea, thanks for adding it along with the tests!

Note that at some point we should think about removing search applications from the challenges here, even if we keep the code in - this product is in maintenance mode.

gbanasiak · 2025-06-17T04:07:40Z

I'm adding WIP label in the title to mute alerts. @fressi-elastic Please remove it once you have some free cycles to go back to this.

Co-authored-by: Grzegorz Banasiak <grzegorz.banasiak@elastic.co>

fressi-elastic force-pushed the cycle-wikipedia-queries branch 4 times, most recently from 0014d3d to c02d5f6 Compare March 31, 2025 14:55

fressi-elastic requested review from a team, carlosdelest, gareth-ellis and kderusso March 31, 2025 14:56

fressi-elastic force-pushed the cycle-wikipedia-queries branch from c02d5f6 to e4a7c55 Compare April 1, 2025 04:07

fressi-elastic requested a review from gbanasiak April 1, 2025 04:09

fressi-elastic force-pushed the cycle-wikipedia-queries branch 3 times, most recently from b9ca1b5 to ab5db01 Compare April 2, 2025 05:04

AI-IshanBhatt reviewed Apr 2, 2025

View reviewed changes

wikipedia/tests/test_track.py Show resolved Hide resolved

Use itertools.cycle to iterate over random queries from the query f…

ec90503

…ile.

fressi-elastic force-pushed the cycle-wikipedia-queries branch from ab5db01 to ec90503 Compare April 2, 2025 12:24

fressi-elastic added 5 commits April 2, 2025 15:17

Use default random seed when none is provided to enforce track reprod…

6d94f28

…ucibility.

Assume default seed is used in all test cases.

7bc4198

Fix test cases: remove seed parameter when preparing the want seque…

2035b40

…nce.

Merge remote-tracking branch 'upstream/master' into cycle-wikipedia-q…

0468f04

…ueries

Fix test case to match the removal of caching.

f7e083b

gbanasiak reviewed Jun 9, 2025

View reviewed changes

wikipedia/track.py Outdated Show resolved Hide resolved

kderusso approved these changes Jun 9, 2025

View reviewed changes

gbanasiak changed the title ~~Use itertools.cycle to iterate over random queries from the query file.~~ [WIP] Use itertools.cycle to iterate over random queries from the query file Jun 17, 2025

Update wikipedia/track.py

5a0e92a

Co-authored-by: Grzegorz Banasiak <grzegorz.banasiak@elastic.co>

gareth-ellis removed their request for review February 23, 2026 09:38

fressi-elastic closed this Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Use `itertools.cycle` to iterate over random queries from the query file#768

[WIP] Use `itertools.cycle` to iterate over random queries from the query file#768
fressi-elastic wants to merge 7 commits intoelastic:masterfrom
fressi-elastic:cycle-wikipedia-queries

fressi-elastic commented Mar 31, 2025

Uh oh!

gareth-ellis commented Mar 31, 2025

Uh oh!

AI-IshanBhatt left a comment

Uh oh!

Uh oh!

gbanasiak left a comment •

edited

Loading

Uh oh!

gbanasiak Jun 9, 2025

Uh oh!

gbanasiak Jun 9, 2025

Uh oh!

Uh oh!

kderusso left a comment

Uh oh!

gbanasiak commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

fressi-elastic commented Mar 31, 2025

Uh oh!

gareth-ellis commented Mar 31, 2025

Uh oh!

AI-IshanBhatt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gbanasiak left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gbanasiak Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

gbanasiak Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kderusso left a comment

Choose a reason for hiding this comment

Uh oh!

gbanasiak commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gbanasiak left a comment •

edited

Loading