Optimize Table Reflection Performance for Large Schemas #619
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimize Table Reflection Performance for Large Schemas
Problem Statement
The current implementation of
get_columns()in the Snowflake SQLAlchemy dialect has a performance issue when working with schemas containing thousands of tables. The dialect unconditionally attempts to prefetch and cache column metadata for all tables in the schema viainformation_schema.columns, even when only reflecting a single table. This causes:The single-table query fallback (using
DESC TABLE) only triggers after the expensive schema-wide query fails, which means users experience slow performance (or timeouts) before the fallback even kicks in.Solution
This PR changes the default behavior to query individual tables directly, with opt-in schema-wide caching through the
cache_column_metadataconnection parameter:When
cache_column_metadata=False(new default):DESC TABLEinformation_schema.columnsquery entirelyWhen
cache_column_metadata=True(opt-in):information_schema.columnsChanges
get_columns()method to check_cache_column_metadataflag before calling_get_schema_columns()_StructuredTypeInfoManager.get_table_columns()- changed fromdenormalize_name()tonormalize_name()for schema and table names to properly match Snowflake's identifier handlingname_utils.normalize_name()for consistent identifier quotingNonein both code paths (DESC TABLE vs information_schema) to ensure identical column metadata regardless of caching settingorder_typetoorderkey to match the required argument name for sqlalchemy.sql.schema.Identity constructorPerformance Impact
For schemas with a large number of tables/columns, this change dramatically improves table reflection performance when reflecting a single table, as it eliminates the costly (and often failing) query that fetches metadata for all tables.
Before (schema-wide caching always attempted):
After (cache_column_metadata=False, new default):
Backward Compatibility
cache_column_metadata=Falseas the default (matching the original implementation)cache_column_metadata=Truecache_column_metadata=True, the fallback mechanism still works for schemas that are too largeUsage
Testing
Added
test_cache_column_metadata.pywith parameterized tests that verify:Related Issue
Addresses Snowflake Support Case #01168351 regarding performance issues with table reflection in large schemas.