Skip to content

[WIP] Use itertools.cycle to iterate over random queries from the query file#768

Closed
fressi-elastic wants to merge 7 commits intoelastic:masterfrom
fressi-elastic:cycle-wikipedia-queries
Closed

[WIP] Use itertools.cycle to iterate over random queries from the query file#768
fressi-elastic wants to merge 7 commits intoelastic:masterfrom
fressi-elastic:cycle-wikipedia-queries

Conversation

@fressi-elastic
Copy link
Contributor

This is a follows up over PR #767

@fressi-elastic fressi-elastic force-pushed the cycle-wikipedia-queries branch 4 times, most recently from 0014d3d to c02d5f6 Compare March 31, 2025 14:55
@gareth-ellis
Copy link
Member

I need to have a go at running this, but my initial reaction is that I think this changes the behaviour of the track. In rally we have two modes of running a task - either time based, or "unit" based - where we would in the track say run this for 10 minutes, or run this 200 times - by not allowing the iterator to renew, it means that if a user says run this 12000 times, then actually it will only run 10000 times, since the query file has 10000 rows.

As i say, I have only very quickly looked at this - i need to run myself to validate.

@fressi-elastic fressi-elastic force-pushed the cycle-wikipedia-queries branch from c02d5f6 to e4a7c55 Compare April 1, 2025 04:07
@fressi-elastic fressi-elastic requested a review from gbanasiak April 1, 2025 04:09
@fressi-elastic fressi-elastic force-pushed the cycle-wikipedia-queries branch 3 times, most recently from b9ca1b5 to ab5db01 Compare April 2, 2025 05:04
Copy link
Contributor

@AI-IshanBhatt AI-IshanBhatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parentheses in the dataclass decoratore

@fressi-elastic fressi-elastic force-pushed the cycle-wikipedia-queries branch from ab5db01 to ec90503 Compare April 2, 2025 12:24
Copy link
Contributor

@gbanasiak gbanasiak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall. I've asked to document the default random seed.

I'd like to hear @elastic/search-relevance opinion. Would you like to pursue with this wikipedia change? The benefits I see are:

  • random seed pinning for potentially better reproducibility of results
  • the use of itertools.cycle makes code simpler as there's no need to handle iterator exhaustion
  • unit tests might help us avoid unintended changes in the future but you might see it as unnecessary boilerplate.

QUERY_RULES_ENDPOINT: str = "/_query_rules"

QUERY_CLEAN_REXEXP = regexp = re.compile("[^0-9a-zA-Z]+")
DEFAULT_SEED = hash(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of pinning the seed. We're doing something similar in elastic/logs through default track parameter value:

* `random_seed` (default: 13) - Files are generated through random sampling of the source corpora. This pseudo random selection process is seeded to ensure multiple runs of the track generate the same data - thus ensuring tests are repeatable. Changing this value or `data_generation_clients` will cause the generation of a different dataset. Must be an integer.
"random-seed": {{ random_seed | default(13) | int }},

Can we document the default in track's README?

self._random_seed = self._params.get("seed", None)
self._sample_queries = query_samples(self._batch_size, self._random_seed)
self._queries_iterator = None
self._queries_it = itertools.cycle(query_samples(k=self._params.get("batch_size", 100000), seed=self._params.get("seed")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to @gareth-ellis point I think this is correct, i.e. itertools.cycle will cycle back to the beginning of an array returned by query_samples() which is what current code is doing. It's convenient at the cost of doubling memory consumption because itertools.cycle stores a local copy of an array I think (see https://docs.python.org/3/library/itertools.html#itertools.cycle).

The queries.csv has 10k lines, and we're asking for x10 more as the default batch_size/k is 100k. Assuming pure ASCII (1 byte / character), and string length of 32 characters on average, that's an increase in the region of 128k * 32 ~ 4MB. I think that's OK.

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, from a Search Relevance team perspective, randomization like this sounds like a good idea, thanks for adding it along with the tests!

Note that at some point we should think about removing search applications from the challenges here, even if we keep the code in - this product is in maintenance mode.

@gbanasiak gbanasiak changed the title Use itertools.cycle to iterate over random queries from the query file. [WIP] Use itertools.cycle to iterate over random queries from the query file Jun 17, 2025
@gbanasiak
Copy link
Contributor

I'm adding WIP label in the title to mute alerts. @fressi-elastic Please remove it once you have some free cycles to go back to this.

Co-authored-by: Grzegorz Banasiak <grzegorz.banasiak@elastic.co>
@gareth-ellis gareth-ellis removed their request for review February 23, 2026 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants