-
Notifications
You must be signed in to change notification settings - Fork 290
Fix TableIdentifier in TaskFileIOSupplier #2304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
we cant just convert a `TaskEntity` to a `IcebergTableLikeEntity` as the `getTableIdentifier()` method will not return a correct value by using the name of the task and its parent namespace (which is empty?). task handlers instead need to pass in the `TableIdentifier` that they already inferred via `TaskEntity.readData`.
e07dd53
to
4427dc8
Compare
@@ -53,7 +53,7 @@ public boolean handleTask(TaskEntity task, CallContext callContext) { | |||
BatchFileCleanupTask cleanupTask = task.readData(BatchFileCleanupTask.class); | |||
TableIdentifier tableId = cleanupTask.tableId(); | |||
List<String> batchFiles = cleanupTask.batchFiles(); | |||
try (FileIO authorizedFileIO = fileIOSupplier.apply(task, callContext)) { | |||
try (FileIO authorizedFileIO = fileIOSupplier.apply(task, tableId, callContext)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the tableId just be a part of the TaskEntity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends on whether there will ever be tasks that dont operate on a single iceberg table... currently the TaskEntity
design seems to leave this open
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about this again, since the TableIdentifier
is downstream only getting used for logging, it would seem an disproportionate amount of effort to change the TaskEntity
to avoid passing in the tableidentifier here (that is already available in each task handler)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the existence of a task that involves multiple tables mean that a given task can't include a table name in its properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry i might not be fully understanding what you had been asking for.
could you please provide a more detailed design of how we would be making the tableId part of the TaskEntity?
and then also why that would be a better solution that simply passing in the tableId that is already available in the task handlers? (which is taken from TaskEntity.readData
so its already part of the TaskEntity
in a way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I completely missed that. Anyway, here is a one-line fix that doesn't change any APIs. From my debugger:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This correctly extracts the TableIdentifier (via the TableLikeEntity) from the TASK_DATA like I mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, here is a one-line fix that doesn't change any APIs
afaict your "one-line fix" only works for table cleanup tasks, where we happen to store a IcebergTableLikeEntity
into the TASK_DATA
:
Lines 1433 to 1437 in 1277eff
Map<String, String> properties = new HashMap<>(); | |
properties.put( | |
PolarisTaskConstants.TASK_TYPE, | |
String.valueOf(AsyncTaskType.ENTITY_CLEANUP_SCHEDULER.typeCode())); | |
properties.put("data", PolarisObjectMapperUtil.serialize(refreshEntityToDrop)); |
for other tasks, it would be throwing an error as we store something else, for example:
Lines 64 to 66 in 20febda
ManifestCleanupTask cleanupTask = task.readData(ManifestCleanupTask.class); | |
TableIdentifier tableId = cleanupTask.tableId(); | |
try (FileIO authorizedFileIO = fileIOSupplier.apply(task, callContext)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I was trying to say above -- if there is a TableIdentifier in the TASK_DATA we should be extracting and using that. The current "cast" of a TaskEntity into a TableLikeEntity is definitely wrong, but if we can fix this the easy way we should.
If there is a task that doesn't have a TableIdentifier in TASK_DATA we should either add it there or remove the need for it (i.e. pass in null
to loadFileIO
and mark it Nullable
) if that's easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is a TableIdentifier in the TASK_DATA we should be extracting and using that.
it seems like we have circled back to a previous point in our discussion, so i am posting my reply again:
So I can just guess that you want to add something like a Optional getTableIdentifier() method to TaskEntity ?
but when we think about how it would need to be implemented, it would mean that it has to contain full knowledge of ALL TYPES of tasks and what kind of objects each one stores in TASK_DATA to de-serialize them and get the TaskIdentifier.but that same knowledge ALREADY EXISTS in the task handlers (and its where we already have the TableIdentifier alongside the other task parameters), so again, just letting the task handlers pass that value into the TaskFileIOSupplier seems like the right approach to me.
i still dont have an answer as to why changing TaskFileIOSupplier
is a no-go when all callers have the TableIdentifier
already and the underlying FileIOFactory
(in its current form) requires it.
we cant just convert a
TaskEntity
to aIcebergTableLikeEntity
as thegetTableIdentifier()
method will not return a correct value by usingthe name of the task and its parent namespace (which is empty?).
task handlers instead need to pass in the
TableIdentifier
that theyalready inferred via
TaskEntity.readData
.