Skip to content

Conversation

@ranophoenix
Copy link

@ranophoenix ranophoenix commented Nov 13, 2025

Optimize Table Reflection Performance for Large Schemas

Problem Statement

The current implementation of get_columns() in the Snowflake SQLAlchemy dialect has a performance issue when working with schemas containing thousands of tables. The dialect unconditionally attempts to prefetch and cache column metadata for all tables in the schema via information_schema.columns, even when only reflecting a single table. This causes:

  • Significant delays when reflecting individual tables in large schemas
  • Unnecessary network overhead and query execution time
  • Query failures when schemas are extremely large (error 90030: "Information schema query returned too much data")
  • Wasted processing time executing expensive queries that ultimately fail and fall back to the granular approach anyway
  • Poor user experience for common use cases (reflecting one or a few tables)

The single-table query fallback (using DESC TABLE) only triggers after the expensive schema-wide query fails, which means users experience slow performance (or timeouts) before the fallback even kicks in.

Solution

This PR changes the default behavior to query individual tables directly, with opt-in schema-wide caching through the cache_column_metadata connection parameter:

When cache_column_metadata=False (new default):

  • Queries only the specific requested table using DESC TABLE
  • Avoids the expensive information_schema.columns query entirely
  • Dramatically improves performance for single-table reflection in large schemas
  • No risk of hitting the "too much data" error

When cache_column_metadata=True (opt-in):

  • Attempts to prefetch all schema columns via information_schema.columns
  • Only beneficial for small-to-medium schemas where the query succeeds
  • Can help amortize costs when reflecting multiple tables sequentially
  • Will still fall back to individual queries if the schema is too large (error 90030)
  • Not recommended for large schemas as it wastes time on a query that will ultimately fail

Changes

  1. Modified get_columns() method to check _cache_column_metadata flag before calling _get_schema_columns()
  2. Changed default behavior from attempting schema-wide caching to querying individual tables
  3. Fixed identifier normalization bug in _StructuredTypeInfoManager.get_table_columns() - changed from denormalize_name() to normalize_name() for schema and table names to properly match Snowflake's identifier handling
  4. Fixed name_utils.normalize_name() for consistent identifier quoting
  5. Fixed BINARY type metadata inconsistency - normalized BINARY column length to None in both code paths (DESC TABLE vs information_schema) to ensure identical column metadata regardless of caching setting
  6. Fixed identity column metadata - changed order_type to order key to match the required argument name for sqlalchemy.sql.schema.Identity constructor
  7. Added comprehensive test coverage with parameterized tests verifying both behaviors
  8. Removed outdated deprecation notice from README as the flag is now functional with proper behavior control

Performance Impact

For schemas with a large number of tables/columns, this change dramatically improves table reflection performance when reflecting a single table, as it eliminates the costly (and often failing) query that fetches metadata for all tables.

Before (schema-wide caching always attempted):

  • Attempt to query all columns in schema → wait → potentially fail with error 90030 → fall back to DESC TABLE
  • Result: Significant delays or timeouts

After (cache_column_metadata=False, new default):

  • Query only the requested table with DESC TABLE
  • Result: Dramatically faster reflection times

Backward Compatibility

  • Restores original default behavior - Returns to cache_column_metadata=False as the default (matching the original implementation)
  • ⚠️ Change from recent versions - Recent versions effectively forced schema-wide caching (the deprecated behavior); this PR makes it opt-in again
  • Opt-in to schema caching - Users can enable schema-wide caching by setting cache_column_metadata=True
  • Automatic fallback preserved - Even with cache_column_metadata=True, the fallback mechanism still works for schemas that are too large
  • Bug fixes improve reliability - The identifier normalization and BINARY type fixes ensure consistent behavior across both caching modes

Usage

# New default behavior (recommended for most cases)
engine = create_engine('snowflake://...')  # cache_column_metadata=False by default

# Opt-in to schema-wide caching (only for small-to-medium schemas)
engine = create_engine('snowflake://...?cache_column_metadata=true')

Testing

Added test_cache_column_metadata.py with parameterized tests that verify:

  • With cache_column_metadata=False: Only 1 DESC call for the requested table (optimal)
  • With cache_column_metadata=True: 1 schema query + additional DESC calls for structured types
  • Proper handling of OBJECT, ARRAY, and MAP column types in both modes
  • Consistent column metadata regardless of caching mode

Related Issue

Addresses Snowflake Support Case #01168351 regarding performance issues with table reflection in large schemas.

@ranophoenix ranophoenix requested a review from a team as a code owner November 13, 2025 19:49
@github-actions
Copy link

github-actions bot commented Nov 13, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@ranophoenix
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@ranophoenix ranophoenix force-pushed the support_case_01168351 branch from 6629361 to 4613797 Compare November 13, 2025 20:51
@ranophoenix ranophoenix force-pushed the support_case_01168351 branch from 4613797 to de862cb Compare November 13, 2025 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant