-
Notifications
You must be signed in to change notification settings - Fork 290
Remove overloads of BasePersistence.listEntities #2262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5944cb1
to
314b3dd
Compare
`BasePersistence.listEntities` and `TransactionalPersistence.listEntitiesInCurrentTxn` are afaict only overloaded to make the `entityFilter` and `transformer` parameters "optional" for some callers. the most central method still has to have support for both of them. using default methods in the interface would enable this more clearly but as it turns out, hardly any callers are utilizing those overloads and thus if we simply remove the overloads we have less code and things become clearer across the board. we also add a `EntityNameLookupRecord.fromEntity` factory method to replace the overloaded constructor.
314b3dd
to
b3a06bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks straightforward
+1
* @return the list of entities for the specified list operation | ||
*/ | ||
@Nonnull | ||
Page<EntityNameLookupRecord> listEntities( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest a dev ML discussion or vote for interface change in this class. cc @dennishuo cc @singhpk234
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular type is actually not even used by any production code outside of a particular persistence implementation.
If a particular implementation needs more functions, those implementations are free to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you help me understand what particular concern one might have?
the innermost methods that support all the parameters are still there (so not that hard to switch to if backward-compatibility is the concern)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because BasePersistence
is a core persistence interface, any change—no matter how small—deserves a note on the dev mailing list so everyone stays in the loop.
this.typeCode = entity.getTypeCode(); | ||
this.name = entity.getName(); | ||
this.subTypeCode = entity.getSubTypeCode(); | ||
public static EntityNameLookupRecord fromEntity(PolarisBaseEntity entity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the motivation for this change here? if no particular reason, can we keep the original constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to add a new static method like this. I'd suggest to deprecate the constructor first and remove it later if we don't need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss on mailing list. At least one big service provider with a custom persistence backend would be hit very badly by this change and if we had implemented the JdbcBasePersistenceImpl more efficiently this would've been a big regression. We should fix the JDBC persistence forward to be efficient.
The TL;DR: The index used for listing by name isn't generally a fully-covering index for any known Polaris persistence impls, and even if it was, we mustn't load full entities from the database all the time just to discard most columns in the "pure name-listing" use cases like in Iceberg listTables
or listNamespaces
.
Some archaeology:
Historically TreeMapStore
was sloppy to store an entire PolarisBaseEntity
in entitiesActive
even though it was representing storing only an EntityNameLookupRecord
in entitiesActive
; originally this was "protected" from the outside world because all entitiesActive
interactions only ever produced EntityNameLookupRecord
or equivalent, but then the addition of the 6-arg version with entityFilter
and the 7-arg version with transformer
incorrectly assumed that this means it's equally cheap to produce the entire entity.
This is analogous to whether we use the INCLUDE
keyword in Postgres when creating an index to make it a COVERING INDEX
(https://www.postgresql.org/docs/current/indexes-index-only-scans.html#INDEXES-INDEX-ONLY-SCANS) or use the STORING
keyword in Google Spanner: https://cloud.google.com/blog/products/databases/how-to-use-a-storing-clause-in-cloud-spanner/
For Postgres here it looks like we let the UNIQUE CONSTRAINT
automatically create a secondary index under the hoot: https://github.com/singhpk234/polaris/blob/6cf26f8e5213802161fde6bd1e190d6fdc345822/scripts/postgres/schema-v1-postgresql.sql#L44
The Postgres docs say:
the uniqueness condition applies to just column x, not to the combination of x and y. (An INCLUDE clause can also be written in UNIQUE and PRIMARY KEY constraints, providing alternative syntax for setting up an index like this.)
Which implies by negation that if you don't have an INCLUDE
clause on your UNIQUE constraint then the UNIQUE index would not automatically be a covering index.
Ultimately we'll need at least 2 variations of the method (one for pure-name-listing, another for the current Filter and Transformer versions), but we probably actually want to do a better job of figuring out what fields we actually needed in either the transformer or filter versions.
Maybe we'd have a catch-all that indeed forces a base join from the secondary index to retrieve the full PolarisEntity, but it's unclear whether the main use cases today actually need all the fields.
parentId, | ||
entityType, | ||
entityFilter, | ||
entity -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so we shouldn't have been lazy with properly expressing the distinctions in the behaviors of the different methods here in TreeMapTransactionalPersistenceImpl
because it hid the real behavioral nuance since inmemory operations are too cheap.
In general the entitesActive
is not a "covering index" for the entire PolarisBaseEntity, so operations that require the whole entity are not in the same category as those which only require EntityNameLookupRecord.
parentId, | ||
entityType, | ||
entityFilter, | ||
EntityNameLookupRecord::new, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an unfortunate inefficiency that accidentally slipped through probably because the TreeMap impl gets away with it due to everything being in-memory. The "basic" version should never have delegated to the "filter" or "transformer" version, and indeed even the filter version probably shouldn't have just delegated to the "transformer" version either.
In practice, at the very least the "basic" version (5-arg version) should have had its own underlying query which only retrieved the columns used in EntityNameLookupRecord
.
It's a pretty huge inefficiency (borderline bug) right now for "pure name listing" to have to select all columns from the database only to throw away all the big columns.
@dennishuo thanks a lot of the detailed reply!
I understand your point that method 1 was supposed to be used by "name-listing" polaris code to only retrieve the entity properties required for building a so method 1 can be faster and cheaper based on the persistence implementation. as your comments also pointed out method 2 is weird as it supports a due to the above fact, in our current codebase method 2 is frequently implemented as forwarding to method 3. also on current polaris/runtime/service/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java Lines 2574 to 2591 in 2f985ab
I am guessing this changed with the introduction of GenericTables and/or the rework of pagination. this also means that even if a private the above is an explanation as to how I arrived at this PR suggesting the removal of method 1 (which is unused) and method 2 (which is forwarding to method 3). As for a proper way forward that keeps the performance benefit in mind I have put up 2 PRs: #2317 #2290 I would appreciate it, if you took the time to look at these PRs as they should address a lot of the shortcomings identified as part of this discussion. |
@XN137 Good digging! I didn't realize that method 1 callsites had regressed. I did a bit more archaeology and it looks like #1938 was actually what introduced the regression for the main Previously AtomicMetaStoreManager (and TransactionalMetaStoreManagerImpl):
After:
Fortunately this looks like this never made it into 1.0.x: Line 713 in d7e28e4
That change would've been somewhat worse than the more-noticeable removal of the overloads for anyone using the standard The very original introduction of both Overall I think your other two PRs are the right direction. The various I'll comment on specifics within each PR. |
as discussed, closing this ticket as |
BasePersistence.listEntities
andTransactionalPersistence.listEntitiesInCurrentTxn
are afaict only overloaded to make the
entityFilter
andtransformer
parameters "optional" for some callers.
the most central method still has to have support for both of them.
using default methods in the interface would enable this more clearly
but as it turns out, hardly any callers are utilizing those overloads
and thus if we simply remove the overloads we have less code and things
become clearer across the board.
we also add a
EntityNameLookupRecord.fromEntity
factory method toreplace the overloaded constructor.