Skip to content

Remove overloads of BasePersistence.listEntities #2262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

XN137
Copy link
Contributor

@XN137 XN137 commented Aug 5, 2025

BasePersistence.listEntities and TransactionalPersistence.listEntitiesInCurrentTxn
are afaict only overloaded to make the entityFilter and transformer
parameters "optional" for some callers.
the most central method still has to have support for both of them.

using default methods in the interface would enable this more clearly
but as it turns out, hardly any callers are utilizing those overloads
and thus if we simply remove the overloads we have less code and things
become clearer across the board.

we also add a EntityNameLookupRecord.fromEntity factory method to
replace the overloaded constructor.

@github-project-automation github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Aug 5, 2025
@XN137 XN137 force-pushed the Simplify-listEntities branch from 5944cb1 to 314b3dd Compare August 5, 2025 14:23
@XN137 XN137 changed the title Simplify listEntities Remove overloads of BasePersistence.listEntities Aug 5, 2025
`BasePersistence.listEntities` and `TransactionalPersistence.listEntitiesInCurrentTxn`
are afaict only overloaded to make the `entityFilter` and `transformer`
parameters "optional" for some callers.
the most central method still has to have support for both of them.

using default methods in the interface would enable this more clearly
but as it turns out, hardly any callers are utilizing those overloads
and thus if we simply remove the overloads we have less code and things
become clearer across the board.

we also add a `EntityNameLookupRecord.fromEntity` factory method to
replace the overloaded constructor.
@XN137 XN137 force-pushed the Simplify-listEntities branch from 314b3dd to b3a06bb Compare August 5, 2025 14:26
@XN137 XN137 marked this pull request as ready for review August 5, 2025 14:27
Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks straightforward
+1

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Aug 5, 2025
* @return the list of entities for the specified list operation
*/
@Nonnull
Page<EntityNameLookupRecord> listEntities(
Copy link
Contributor

@flyrain flyrain Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest a dev ML discussion or vote for interface change in this class. cc @dennishuo cc @singhpk234

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular type is actually not even used by any production code outside of a particular persistence implementation.

If a particular implementation needs more functions, those implementations are free to keep it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me understand what particular concern one might have?

the innermost methods that support all the parameters are still there (so not that hard to switch to if backward-compatibility is the concern)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because BasePersistence is a core persistence interface, any change—no matter how small—deserves a note on the dev mailing list so everyone stays in the loop.

this.typeCode = entity.getTypeCode();
this.name = entity.getName();
this.subTypeCode = entity.getSubTypeCode();
public static EntityNameLookupRecord fromEntity(PolarisBaseEntity entity) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the motivation for this change here? if no particular reason, can we keep the original constructor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to add a new static method like this. I'd suggest to deprecate the constructor first and remove it later if we don't need it.

Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss on mailing list. At least one big service provider with a custom persistence backend would be hit very badly by this change and if we had implemented the JdbcBasePersistenceImpl more efficiently this would've been a big regression. We should fix the JDBC persistence forward to be efficient.

The TL;DR: The index used for listing by name isn't generally a fully-covering index for any known Polaris persistence impls, and even if it was, we mustn't load full entities from the database all the time just to discard most columns in the "pure name-listing" use cases like in Iceberg listTables or listNamespaces.

Some archaeology:

Historically TreeMapStore was sloppy to store an entire PolarisBaseEntity in entitiesActive even though it was representing storing only an EntityNameLookupRecord in entitiesActive; originally this was "protected" from the outside world because all entitiesActive interactions only ever produced EntityNameLookupRecord or equivalent, but then the addition of the 6-arg version with entityFilter and the 7-arg version with transformer incorrectly assumed that this means it's equally cheap to produce the entire entity.

This is analogous to whether we use the INCLUDE keyword in Postgres when creating an index to make it a COVERING INDEX (https://www.postgresql.org/docs/current/indexes-index-only-scans.html#INDEXES-INDEX-ONLY-SCANS) or use the STORING keyword in Google Spanner: https://cloud.google.com/blog/products/databases/how-to-use-a-storing-clause-in-cloud-spanner/

For Postgres here it looks like we let the UNIQUE CONSTRAINT automatically create a secondary index under the hoot: https://github.com/singhpk234/polaris/blob/6cf26f8e5213802161fde6bd1e190d6fdc345822/scripts/postgres/schema-v1-postgresql.sql#L44

The Postgres docs say:

the uniqueness condition applies to just column x, not to the combination of x and y. (An INCLUDE clause can also be written in UNIQUE and PRIMARY KEY constraints, providing alternative syntax for setting up an index like this.)

Which implies by negation that if you don't have an INCLUDE clause on your UNIQUE constraint then the UNIQUE index would not automatically be a covering index.

Ultimately we'll need at least 2 variations of the method (one for pure-name-listing, another for the current Filter and Transformer versions), but we probably actually want to do a better job of figuring out what fields we actually needed in either the transformer or filter versions.

Maybe we'd have a catch-all that indeed forces a base join from the secondary index to retrieve the full PolarisEntity, but it's unclear whether the main use cases today actually need all the fields.

parentId,
entityType,
entityFilter,
entity ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so we shouldn't have been lazy with properly expressing the distinctions in the behaviors of the different methods here in TreeMapTransactionalPersistenceImpl because it hid the real behavioral nuance since inmemory operations are too cheap.

In general the entitesActive is not a "covering index" for the entire PolarisBaseEntity, so operations that require the whole entity are not in the same category as those which only require EntityNameLookupRecord.

parentId,
entityType,
entityFilter,
EntityNameLookupRecord::new,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an unfortunate inefficiency that accidentally slipped through probably because the TreeMap impl gets away with it due to everything being in-memory. The "basic" version should never have delegated to the "filter" or "transformer" version, and indeed even the filter version probably shouldn't have just delegated to the "transformer" version either.

In practice, at the very least the "basic" version (5-arg version) should have had its own underlying query which only retrieved the columns used in EntityNameLookupRecord.

It's a pretty huge inefficiency (borderline bug) right now for "pure name listing" to have to select all columns from the database only to throw away all the big columns.

@github-project-automation github-project-automation bot moved this from Ready to merge to PRs In Progress in Basic Kanban Board Aug 8, 2025
@XN137
Copy link
Contributor Author

XN137 commented Aug 11, 2025

@dennishuo thanks a lot of the detailed reply!

BasePersistence.listEntities has 3 variants:

Page<EntityNameLookupRecord> listEntities(..., PageToken);

Page<EntityNameLookupRecord> listEntities(..., Predicate<PolarisBaseEntity>, PageToken)

<T> Page<T> listEntities(..., Predicate<PolarisBaseEntity>, Function<PolarisBaseEntity, T>, PageToken);

I understand your point that method 1 was supposed to be used by "name-listing" polaris code to only retrieve the entity properties required for building a EntityNameLookupRecord (i.e. most prominently listTables or listNamespaces).
while method 3 needs to load the full entity properties to support the predicate and transformer parameters.

so method 1 can be faster and cheaper based on the persistence implementation.

as your comments also pointed out method 2 is weird as it supports a Predicate<PolarisBaseEntity> which requires loading the full entity, but then throws most of the loaded entity properties away to only return a EntityNameLookupRecord.
so any performance benefit of the latter are lost due to the former.

due to the above fact, in our current codebase method 2 is frequently implemented as forwarding to method 3.

also on current main method 1 has no callers at all, as listTables for example needs to pass in a PolarisEntitySubType subType filter and is thus using method 2:

private Page<TableIdentifier> listTableLike(
PolarisEntitySubType subType, Namespace namespace, PageToken pageToken) {
PolarisResolvedPathWrapper resolvedEntities = resolvedEntityView.getResolvedPath(namespace);
if (resolvedEntities == null) {
// Illegal state because the namespace should've already been in the static resolution set.
throw new IllegalStateException(
String.format("Failed to fetch resolved namespace '%s'", namespace));
}
List<PolarisEntity> catalogPath = resolvedEntities.getRawFullPath();
ListEntitiesResult listResult =
getMetaStoreManager()
.listEntities(
getCurrentPolarisContext(),
PolarisEntity.toCoreList(catalogPath),
PolarisEntityType.TABLE_LIKE,
subType,
pageToken);

I am guessing this changed with the introduction of GenericTables and/or the rework of pagination.

this also means that even if a private PolarisMetaStoreManager impl is using method 1, its a bit of a private implementation detail as no general polaris code is calling method 1 directly.

the above is an explanation as to how I arrived at this PR suggesting the removal of method 1 (which is unused) and method 2 (which is forwarding to method 3).

As for a proper way forward that keeps the performance benefit in mind I have put up 2 PRs:

#2317
allows the removal of method 2 by letting the other methods support a PolarisEntitySubType subType parameter that can even be pushed down into the queries.

#2290
removes a common anti-pattern in the codebase where callers do listEntites - stream - loadEntity which also defeats any performance optimizations that listEntites might have.

I would appreciate it, if you took the time to look at these PRs as they should address a lot of the shortcomings identified as part of this discussion.
After receiving your feedback we can just close this PR (as it's mostly replaced by the first PR mentioned above).

@dennishuo
Copy link
Contributor

@XN137 Good digging! I didn't realize that method 1 callsites had regressed. I did a bit more archaeology and it looks like #1938 was actually what introduced the regression for the main listTables/listNamespaces behaviors:

Previously AtomicMetaStoreManager (and TransactionalMetaStoreManagerImpl):

// prune the returned list with only entities matching the entity subtype
if (entitySubType != PolarisEntitySubType.ANY_SUBTYPE) {
  resultPage =
      pageToken.buildNextPage(
          resultPage.items.stream()
              .filter(rec -> rec.getSubTypeCode() == entitySubType.getCode())
              .collect(Collectors.toList()));
}

After:

// prune the returned list with only entities matching the entity subtype
Predicate<PolarisBaseEntity> filter =
    entitySubType != PolarisEntitySubType.ANY_SUBTYPE
        ? e -> e.getSubTypeCode() == entitySubType.getCode()
        : entity -> true;

Page<EntityNameLookupRecord> resultPage =
    ms.listEntities(callCtx, catalogId, parentId, entityType, filter, pageToken);

Fortunately this looks like this never made it into 1.0.x:

That change would've been somewhat worse than the more-noticeable removal of the overloads for anyone using the standard AtomicMetaStoreManager or TransactionalMetaStoreManagerImpl with a private implementation of BasePersistence/TransactionalPersistence who may have come to depend strongly on the differentiation in performance between the different overloads, since unfortunately #1938 would silently break this behavioral expectation.

The very original introduction of both filter and transformer was honestly probably too rushed as a short-term crutch, and was exclusively for the loadTasks which required very open-ended post-processing filtering.

Overall I think your other two PRs are the right direction. The various listEntities...map(loadEntity...) flows are indeed one of the big known pain points no one got around to fixing yet so that's actually great to see!

I'll comment on specifics within each PR.

@XN137
Copy link
Contributor Author

XN137 commented Aug 13, 2025

as discussed, closing this ticket as https://github.com/apache/polaris/pull/2317 is mostly the replacement.

@XN137 XN137 closed this Aug 13, 2025
@github-project-automation github-project-automation bot moved this from PRs In Progress to Done in Basic Kanban Board Aug 13, 2025
@XN137 XN137 deleted the Simplify-listEntities branch August 13, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants