-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-38495][table-runtime] Introduce cache in delta join operator #27099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
<td><h5>table.exec.delta-join.cache-enabled</h5><br> <span class="label label-primary">Streaming</span></td> | ||
<td style="word-wrap: break-word;">true</td> | ||
<td>Boolean</td> | ||
<td>Whether enable the cache of delta join. If enabled, the delta join would cache the records from remote dim table.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would cache -> caches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be useful to give some guidance in the docs as to when to switch on caching and how to tune the left and right time values. I assume there is some level of staleness we introduce by using the caches, we should talk about this consideration as well. I am interested in what happens to the join results during cache an after the cache invalidates one side after the timeout.
The Jira gives no details. I would like to see details around the motivation behind this change, in which circumstances it is most and least useful so we can easily see how and when this adds value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache in delta joins does not affect the correctness of the final results due to data aging; it only impacts the intermediate results.
In practical usage, the cost of performing remote I/O access for each individual record without caching can be quite substantial. Therefore, I believe that enabling the cache by default, as long as it doesn’t compromise data correctness, may be a better approach. This is also why I mentioned that this parameter is set to true by default in Flip-486.
WDYT?
docs/layouts/shortcodes/generated/execution_config_configuration.html
Outdated
Show resolved
Hide resolved
public static final ConfigOption<Boolean> TABLE_EXEC_DELTA_JOIN_CACHE_ENABLED = | ||
key("table.exec.delta-join.cache-enabled") | ||
.booleanType() | ||
.defaultValue(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your thinking to have a the default as true, I assume this is changing the existing non caching behaviour. It would make more sense to me to have the users opt into the caching and having the default as false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained earlier.
.booleanType() | ||
.defaultValue(true) | ||
.withDescription( | ||
"Whether enable the cache of delta join. If enabled, the delta join would cache the records from remote dim table."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Whether enable -> Flag to enable
suggest not using would as per previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about updating this comment to Whether to enable the cache of delta join. If enabled, the delta join caches the records from remote dim table. Default is true.
to align with other configs like TABLE_EXEC_SORT_ASYNC_MERGE_ENABLED
, TABLE_EXEC_SPILL_COMPRESSION_ENABLED
, ...?
...src/main/java/org/apache/flink/table/planner/plan/nodes/exec/stream/StreamExecDeltaJoin.java
Outdated
Show resolved
Hide resolved
|
||
class DeltaJoinITCase extends StreamingTestBase { | ||
@ExtendWith(Array(classOf[ParameterizedTestExtension])) | ||
class DeltaJoinITCase(enableCache: Boolean) extends StreamingTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can have store/restore tests for Delta joins like tests extending https://github.com/apache/flink/blob/master/flink-table/flink-table-planner/src/test/java/org/apache/flink/table/planner/plan/nodes/exec/testutils/RestoreTestBase.java
What is the purpose of the change
Introduce cache in delta join operator
Brief change log
Verifying this change
New tests have been added to verify this pr.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: yesDocumentation