[WIP] Use itertools.cycle to iterate over random queries from the query file#768
[WIP] Use itertools.cycle to iterate over random queries from the query file#768fressi-elastic wants to merge 7 commits intoelastic:masterfrom
itertools.cycle to iterate over random queries from the query file#768Conversation
0014d3d to
c02d5f6
Compare
|
I need to have a go at running this, but my initial reaction is that I think this changes the behaviour of the track. In rally we have two modes of running a task - either time based, or "unit" based - where we would in the track say run this for 10 minutes, or run this 200 times - by not allowing the iterator to renew, it means that if a user says run this 12000 times, then actually it will only run 10000 times, since the query file has 10000 rows. As i say, I have only very quickly looked at this - i need to run myself to validate. |
c02d5f6 to
e4a7c55
Compare
b9ca1b5 to
ab5db01
Compare
AI-IshanBhatt
left a comment
There was a problem hiding this comment.
Parentheses in the dataclass decoratore
ab5db01 to
ec90503
Compare
There was a problem hiding this comment.
This looks good overall. I've asked to document the default random seed.
I'd like to hear @elastic/search-relevance opinion. Would you like to pursue with this wikipedia change? The benefits I see are:
- random seed pinning for potentially better reproducibility of results
- the use of
itertools.cyclemakes code simpler as there's no need to handle iterator exhaustion - unit tests might help us avoid unintended changes in the future but you might see it as unnecessary boilerplate.
| QUERY_RULES_ENDPOINT: str = "/_query_rules" | ||
|
|
||
| QUERY_CLEAN_REXEXP = regexp = re.compile("[^0-9a-zA-Z]+") | ||
| DEFAULT_SEED = hash(__name__) |
There was a problem hiding this comment.
I like the idea of pinning the seed. We're doing something similar in elastic/logs through default track parameter value:
rally-tracks/elastic/logs/README.md
Line 242 in f3e4ad5
rally-tracks/elastic/logs/track.json
Line 155 in f3e4ad5
Can we document the default in track's README?
| self._random_seed = self._params.get("seed", None) | ||
| self._sample_queries = query_samples(self._batch_size, self._random_seed) | ||
| self._queries_iterator = None | ||
| self._queries_it = itertools.cycle(query_samples(k=self._params.get("batch_size", 100000), seed=self._params.get("seed"))) |
There was a problem hiding this comment.
Back to @gareth-ellis point I think this is correct, i.e. itertools.cycle will cycle back to the beginning of an array returned by query_samples() which is what current code is doing. It's convenient at the cost of doubling memory consumption because itertools.cycle stores a local copy of an array I think (see https://docs.python.org/3/library/itertools.html#itertools.cycle).
The queries.csv has 10k lines, and we're asking for x10 more as the default batch_size/k is 100k. Assuming pure ASCII (1 byte / character), and string length of 32 characters on average, that's an increase in the region of 128k * 32 ~ 4MB. I think that's OK.
kderusso
left a comment
There was a problem hiding this comment.
Hey, from a Search Relevance team perspective, randomization like this sounds like a good idea, thanks for adding it along with the tests!
Note that at some point we should think about removing search applications from the challenges here, even if we keep the code in - this product is in maintenance mode.
itertools.cycle to iterate over random queries from the query file.itertools.cycle to iterate over random queries from the query file
|
I'm adding |
Co-authored-by: Grzegorz Banasiak <grzegorz.banasiak@elastic.co>
This is a follows up over PR #767