- The readset of a query, called
read_table_idsin Redset, is the set of distinct tables it scans. - Redset user: Defined by a
user_idand aninstance_id. In Redset,instance_idreferes to the cluster's id. - Query hash of a Redset query: Hash that we use to decide whether two queries are the same. Redset includes a
feature_fingerprintcolumn that is a "proxy for query similarity". They warn that this fingerprint might have false positives. We found cases where this fingerprint can be equal for two different queries with different numbers of joins or scans, or different readsets. To correct such discrepancies, we define our own query hash as the combination offeature_fingerprint,num_scans,num_joins, andreadset. - Query repetition rate is defined for a Redset user as the ratio of his queries that have already occured before (based on the query hash). Notice that the repetition rate can never be 100% since the first occurence of a query hash does not count as a repetition.
We use the IMDb benchmarks CEB and JOB. Both are defined in terms of query templates that can be freely instantiated. We use pre-instantiated versions of CEB and JOB, which we combine together for more distinct templates and query instances to sample from.
| Number of query instances | Number of templates | |
|---|---|---|
| CEB | 13646 | 16 |
| JOB | 113 | 33 |
| Total (CEB+) | 13759 | 49 |
Note
In the remainder of this document, CEB+ will always refer to the combination of CEB and JOB.
Later, while sampling workloads, we will map user queries to CEB+ queries based on their normalized number of joins -> we want to have as many distinct templates and query instances for each possible value of number of joins. Therefore, we only keep queries with number of joins between 6 and 11. The choice of these values is empirical.
| Number of remaining query instances | Number of remaining templates | |
|---|---|---|
| CEB | 12498 / 13646 | 13 / 16 |
| JOB | 79 / 113 | 23 / 33 |
| Total (CEB+) | 12577 / 13759 | 36 / 49 |
Redset is a dataset of customer query metadata, spanning over 3 months and 200 clusters, published by Redshift. Redset comes in two types: provisioned and serverless. We use the provisioned version and prefilter as follows:
- We exclude:
- non-select queries, i.e.,
query_type != 'select'. - queries that scan external tables, i.e.,
num_external_tables_accessed > 0. - queries that scan system tables, i.e.,
num_system_tables_accessed > 0. - queries that do not scan any table, i.e.,
read_table_ids is null. - queries with no joins, i.e.,
num_joins = 0. - queries that were answered from result cache, i.e.,
was_cached = 1. - queries with
num_joins!= |readset| - 1. We want the readset to uniquely identifynum_joins.
- non-select queries, i.e.,
- For each user:
- We find the week where they issued (based on
arrival_timestamp) the most queries. A week starts on Monday 8am and ends on Friday 5pm. - We only keep the queries from that week.
- We find the week where they issued (based on
- We take the remaining queries from the previous step, we keep at most the first 1000 queries of each user, sorted by
arrival_timestamp. - We take the remaining queries from the previous step and exclude:
- users where
max_num_joins - min_num_joins > 11.- The intuition is that if the gap between
min_num_joinsandmax_num_joinsis too big, too many Redset queries with different numbers of joins will be mapped to the same number of joins in CEB+ queries later during workload sampling → risk of overestimating repetition.
- The intuition is that if the gap between
- users where
max_num_joins == min_num_joins.- Later, during the workload sampling, we will need to normalize the number of joins. Equal min and max values prevent us from normalizing (divide by 0).
- users where
→ Remaining Redset queries: 65188 / 441.35M queries.
→ Remaining Redset users: 340 / 192608 users.
Note
In the remainder of this README, Redset will always refer to this prefiltered version.
From each of the 10 query repetition groups 0% - 10%, 10% - 20% up to 90% - 100%, we pick 3 users as follows:
- We rank the users in ascending order:
- once based on the number of distinct
num_joinsvalues across their queries. - and once on the number of distinct
readsetsacross their queries.
- once based on the number of distinct
- We sum up both ranks for each user to get the user's
workload_variabality. A lowerworkload_variabalityvalue means less distinct queries in terms of number of joins and readsets. - We select the users with the lowest, median, and highest
workload_variabalityvalues.
→ this ensures more diverse workloads in Redbench.
Example:
The figure shows both ranks on the x- and y-axis for each user in the 20%-30% query repetition group. Summing up both dimensions for each user to get workload_variability corresponds to computing the manhattan distance. The dotted line represents the points that have a manhattan distance equal to the median workload_variability. The median workload_variability user (orange dot in the figure) that we sample must lie on this line. Every point above the line has a higher variability, and vice versa.
We want to map each sampled user's query timeline to a sequence of CEB+ queries. To do so, we iterate over the timeline and for each query:
user_query_hashalready seen before -> use the CEB+ query instance mapped to it.user_query_hashhash never seen before. We want to build a bijectionuser_query_readset <-> ceb+_template:-
(1) We have already seen
user_query_readsetbefore, i.e.,user_query_readsethas aceb+_templatemapped to it:- (2)
ceb+_templatehas unused query instances -> use one of its unused query instances. - (3)
ceb+_templatehas no unused query instance -> fallback to (6 or 7).
- (2)
-
(4) We haven't seen
user_query_readsetbefore, i.e.,user_query_readsethas noceb+_templatemapped to it:closest_ceb+_queries: All CEB+ queries with the closest number of joins as the user's query, after normalizing both to values between 0 and 1.closest_ceb+_templates: Templates of theclosest_ceb+_queries.- (5) At least one of the
closest_ceb+_templatesis still unmapped -> pick the one with the most number of query instances, insert the mappingceb+_template <-> user_query_readset, and use one of the unused query instances ofceb+_template. - (6) At least one of the already mapped
closest_ceb+_templateshas unused query instances -> pick one randomly and use one of its unused query instances. - (7) None of the
closest_ceb+_templateshas unused query instances -> use one of the query instances inclosest_ceb+_queries
- (5) At least one of the
-
Notice:
- Steps
(2)and(5)are happy paths. - Reaching step
(7)means that CEB+ doesn't have enough query instances to mimic the user's workload. - Follwing the path
(3 -> 6)or(3 -> 7)means that some CEB+ templates don't have enough query instances to mimic the user's workload. - Following the path
(4 -> 6)means that CEB+ doesn't have enough distinct templates to mimic the user's workload. - Statistics on how often we follow each path can be found in
workloads/<group>/stats.csv.
Each subdirectory of workloads/ represents a query repetition group between 0% - 10% to 90% - 100% and contains:
- One CSV file named
stats.csvcontaining stats about the workload sampling step. - For each of the 3 sampled users, one CSV file describing the sampled workload.
Running Redbench boils down to converting the 30 CSV files to SQL files (setup.py), running them, and measuring execution times.
- In steps (5) and (6) of the sampling process, instead of choosing a
ceb+_templaterandomly or based on the number of remaining query instances, we can simulate the sampling process differently to find the optimal choosing strategy to minimze the numbers of time we reach the steps (6) and (7). - To artificially augment the data, we can partition each user's queries timeline into disjunct blocks of 1000 consecutive queries each. Each block counts as an artifical new user, e.g., for a user with 2003 queries, we will end up with 3 artifical users with 1000, 1000, and 3 queries respectively.
- But, this might result in overrepresenting original users that have too many queries.
- The current version of Redbench only considers SELECT queries. For cerain use cases, this might not be enough, e.g., to benchmark caching techniques where updates would invalidate the caches. Redset includes a
write_table_idscolumn that we can make use of.








