Skip to content

feat: add ADC authentication#72

Open
yuriolive wants to merge 9 commits into
MeltanoLabs:mainfrom
yuriolive:main
Open

feat: add ADC authentication#72
yuriolive wants to merge 9 commits into
MeltanoLabs:mainfrom
yuriolive:main

Conversation

@yuriolive

@yuriolive yuriolive commented Apr 25, 2026

Copy link
Copy Markdown

This pull request improves the authentication flexibility and metadata discovery reliability of tap-bigquery. The primary focus is aligning credential resolution with standard Google Cloud patterns (Application Default Credentials) and fixing issues where table names were being incorrectly prefixed during catalog discovery.


Key Changes

1. Robust Credential Resolution (ADC Support)

Introduced a centralized _get_bigquery_client utility and updated BigQueryConnector.create_engine to follow a standard resolution order. This allows the tap to work seamlessly in environments like GKE, Cloud Run, or Airflow using Workload Identity without requiring explicit configuration.

Resolution Order:

  1. Config: google_application_credentials (can be a JSON string, a Python dict, or a file path).
  2. Environment Variable: GOOGLE_APPLICATION_CREDENTIALS_STRING (JSON string).
  3. Environment Variable: GOOGLE_APPLICATION_CREDENTIALS (Path to file).
  4. Implicit ADC: Fallback to environment-based Application Default Credentials.

2. Improved Table Discovery & Normalization

  • Per-table Reflection: Replaced bulk reflection with a one-by-one table discovery approach in discover_catalog_entries. This prevents the SQLAlchemy dialect from incorrectly resolving dataset names as project IDs during bulk operations.
  • Name Normalization: Added _normalize_table_name to consistently strip dataset/schema prefixes from table names. This ensures that tap_stream_id and table identifiers remain clean and consistent.
  • Default Selection: Discovered streams are now explicitly marked as selected-by-default in the metadata to improve the "out-of-the-box" experience for users.

3. Schema & Configuration Updates

  • Optional Credentials: The google_application_credentials config property is no longer marked as required=True, as the tap can now rely on environment variables or ADC.
  • Documentation: Updated README.md to clearly outline the new authentication flow.

Technical Details

  • tap_bigquery/client.py: Refactored the client creation logic into a standalone function used by the stream's cached property. Added support for passing project_id to the BigQuery client to ensure correct billing and scoping.
  • tap_bigquery/connector.py:
    • Updated create_engine to support credentials_path for file-based auth via SQLAlchemy.
    • Implemented discover_catalog_entries and _ensure_selected_by_default.
  • tap_bigquery/tap.py: Updated JSON schema for google_application_credentials to reflect its optional status and added a more descriptive explanation for users.

How to Test

  1. ADC Test: Run the tap in an environment with gcloud auth application-default login active and no google_application_credentials in the config.
  2. String Config Test: Provide the service account JSON directly as a string in the config.json.
  3. Discovery Test: Run tap-bigquery --discover and verify that tap_stream_id values do not contain redundant dataset prefixes (e.g., my_dataset.my_table should be my_table if the stream ID is already prefixed by schema).

yuriolive and others added 9 commits February 9, 2026 11:40
- Make google_application_credentials optional in tap config
- Resolve credentials in order: config, GOOGLE_APPLICATION_CREDENTIALS_STRING,
  GOOGLE_APPLICATION_CREDENTIALS path, then ADC (workload identity/GKE/Airflow)
- Add _get_bigquery_client() in client.py with env and default() fallback
- Connector create_engine: check env vars when config has no credentials
- Document credential resolution and ADC in README

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add return type annotation to _get_bigquery_client()
- Add debug logging when JSON parsing fails (instead of silent pass)
- Change connector log level from warning to debug for normal fallback
- Document why json_serializer/deserializer are not needed

Co-authored-by: Cursor <cursoragent@cursor.com>
feat: add Application Default Credentials (ADC) support
…talog discovery

- Introduced a static method `_normalize_table_name` to strip duplicated schema prefixes from table names.
- Updated `discover_catalog_entries` to utilize the new normalization method for consistent table name handling.
- Adjusted `get_object_names` to call the normalization method, ensuring uniformity in table name representation.
- Added capabilities to the `TapBigQuery` class for improved functionality.

Co-authored-by: Cursor <cursoragent@cursor.com>
feat: enhance BigQuery connector with table name normalization and ca…
…y in BigQueryConnector. Added a static method to strip schema prefixes from table names and updated the discover_catalog_entries method to utilize this normalization for consistent table name handling.
- Added a static method `_ensure_selected_by_default` to enforce stream-level selected-by-default metadata on discovered entries.
- Updated `discover_catalog_entries` to include logging of discovered streams and ensure default selection for new entries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant