Skip to content

Fix TableIdentifier in TaskFileIOSupplier #2304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

XN137
Copy link
Contributor

@XN137 XN137 commented Aug 8, 2025

we cant just convert a TaskEntity to a IcebergTableLikeEntity as the
getTableIdentifier() method will not return a correct value by using
the name of the task and its parent namespace (which is empty?).

task handlers instead need to pass in the TableIdentifier that they
already inferred via TaskEntity.readData.

we cant just convert a `TaskEntity` to a `IcebergTableLikeEntity` as the
`getTableIdentifier()` method will not return a correct value by using
the name of the task and its parent namespace (which is empty?).

task handlers instead need to pass in the `TableIdentifier` that they
already inferred via `TaskEntity.readData`.
@XN137 XN137 force-pushed the Fix-TableIdentifier-in-TaskFileIOSupplier branch from e07dd53 to 4427dc8 Compare August 8, 2025 16:06
@XN137 XN137 marked this pull request as ready for review August 8, 2025 16:24
@@ -53,7 +53,7 @@ public boolean handleTask(TaskEntity task, CallContext callContext) {
BatchFileCleanupTask cleanupTask = task.readData(BatchFileCleanupTask.class);
TableIdentifier tableId = cleanupTask.tableId();
List<String> batchFiles = cleanupTask.batchFiles();
try (FileIO authorizedFileIO = fileIOSupplier.apply(task, callContext)) {
try (FileIO authorizedFileIO = fileIOSupplier.apply(task, tableId, callContext)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the tableId just be a part of the TaskEntity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends on whether there will ever be tasks that dont operate on a single iceberg table... currently the TaskEntity design seems to leave this open

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about this again, since the TableIdentifier is downstream only getting used for logging, it would seem an disproportionate amount of effort to change the TaskEntity to avoid passing in the tableidentifier here (that is already available in each task handler)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the existence of a task that involves multiple tables mean that a given task can't include a table name in its properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i might not be fully understanding what you had been asking for.
could you please provide a more detailed design of how we would be making the tableId part of the TaskEntity?
and then also why that would be a better solution that simply passing in the tableId that is already available in the task handlers? (which is taken from TaskEntity.readData so its already part of the TaskEntity in a way)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I completely missed that. Anyway, here is a one-line fix that doesn't change any APIs. From my debugger:
Screenshot 2025-08-13 at 11 15 05 AM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This correctly extracts the TableIdentifier (via the TableLikeEntity) from the TASK_DATA like I mentioned above.

Copy link
Contributor Author

@XN137 XN137 Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, here is a one-line fix that doesn't change any APIs

afaict your "one-line fix" only works for table cleanup tasks, where we happen to store a IcebergTableLikeEntity into the TASK_DATA:

Map<String, String> properties = new HashMap<>();
properties.put(
PolarisTaskConstants.TASK_TYPE,
String.valueOf(AsyncTaskType.ENTITY_CLEANUP_SCHEDULER.typeCode()));
properties.put("data", PolarisObjectMapperUtil.serialize(refreshEntityToDrop));

for other tasks, it would be throwing an error as we store something else, for example:

ManifestCleanupTask cleanupTask = task.readData(ManifestCleanupTask.class);
TableIdentifier tableId = cleanupTask.tableId();
try (FileIO authorizedFileIO = fileIOSupplier.apply(task, callContext)) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I was trying to say above -- if there is a TableIdentifier in the TASK_DATA we should be extracting and using that. The current "cast" of a TaskEntity into a TableLikeEntity is definitely wrong, but if we can fix this the easy way we should.

If there is a task that doesn't have a TableIdentifier in TASK_DATA we should either add it there or remove the need for it (i.e. pass in null to loadFileIO and mark it Nullable) if that's easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is a TableIdentifier in the TASK_DATA we should be extracting and using that.

it seems like we have circled back to a previous point in our discussion, so i am posting my reply again:

So I can just guess that you want to add something like a Optional getTableIdentifier() method to TaskEntity ?
but when we think about how it would need to be implemented, it would mean that it has to contain full knowledge of ALL TYPES of tasks and what kind of objects each one stores in TASK_DATA to de-serialize them and get the TaskIdentifier.

but that same knowledge ALREADY EXISTS in the task handlers (and its where we already have the TableIdentifier alongside the other task parameters), so again, just letting the task handlers pass that value into the TaskFileIOSupplier seems like the right approach to me.

i still dont have an answer as to why changing TaskFileIOSupplier is a no-go when all callers have the TableIdentifier already and the underlying FileIOFactory (in its current form) requires it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants