Skip to content

[Feature] Private PyPI packages in Python UDFs #12655

@sriramr98

Description

@sriramr98

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Support Private PyPI packages for Python UDFs

While #12041 adds support for specifying public PyPI packages in Python UDFs via the packages config, many organizations rely on private PyPI repositories (e.g., Artifactory, AWS CodeArtifact, GCP Artifact Registry, Azure Artifacts, Nexus, Gemfury) to host internal Python packages. There is currently no way to configure dbt to authenticate against a private package index when resolving UDF packages.

Additionally, warehouses vary in how they accept custom/private code — Snowflake supports uploading .zip files via stages, BigQuery supports importing .py files from Cloud Storage — but neither supports authenticating against a private PyPI index directly. dbt could bridge this gap by resolving packages from a private index and preparing them in the format each warehouse requires.

Motivation

  • Teams building internal Python libraries (e.g., shared feature engineering, custom ML models, proprietary transforms) cannot use them in dbt Python UDFs today.
  • Warehouse-native package support is limited to public PyPI:
    • Snowflake: Supports public PyPI via ARTIFACT_REPOSITORY + PACKAGES, but docs explicitly state "Access to private repositories is not supported." Custom code can be uploaded as .zip files to Snowflake stages and referenced via IMPORTS.
    • BigQuery: Supports public PyPI via OPTIONS(packages=["..."]) (wheels only). Custom Python files can be imported from GCS via OPTIONS(library=["gs://..."]), but this is limited to .py files, not full package archives.
  • A dbt-native solution would provide a consistent, cross-adapter experience for private package consumption.

Proposed design

1. New package-indexes configuration (project-level)

Allow users to configure one or more private PyPI indexes in dbt_project.yml:

  package-indexes:
    - name: internal
      url: https://pypi.internal.company.com/simple/
      auth:
        type: token  # or "basic", "env"
        token: "{{ env_var('PRIVATE_PYPI_TOKEN') }}"
    - name: aws-codeartifact
      url: https://my-domain-123456789.d.codeartifact.us-east-1.amazonaws.com/pypi/my-repo/simple/
      auth:
        type: basic
        username: "{{ env_var('CODEARTIFACT_USER') }}"
        password: "{{ env_var('CODEARTIFACT_TOKEN') }}"

auth.type supports common authentication patterns: token (bearer/API token), basic (username + password), and env (fully custom, e.g., for AWS_SESSION_TOKEN).

2. Extended packages config on UDFs/models

Allow package specs to reference a named index:

  functions:
    - name: my_function
      config:
        packages:
          - scikit-learn  # resolved from public PyPI (default)
          - name: my-internal-lib
            version: ">=1.2.0"
            index: internal  # references the named index above
          - name: another-lib
            index: aws-codeartifact

The simple string form (- scikit-learn) remains supported for backward compatibility with public PyPI.

3. Artifact staging configuration (adapter-specific)

Since warehouses require pre-built artifacts to be uploaded to warehouse-specific storage, users need to configure where dbt stages these artifacts and how to authenticate. This would live in profiles.yml alongside existing adapter connection config:

Snowflake:

  my_snowflake_profile:
    target: dev
    outputs:
      dev:
        type: snowflake
        # ... existing connection config ...
        python_package_staging:
          stage: my_db.my_schema.python_packages  # Snowflake stage for .zip uploads

BigQuery:

  my_bq_profile:
    target: dev
    outputs:
      dev:
        type: bigquery
        # ... existing connection config ...
        python_package_staging:
          gcs_bucket: my-project-dbt-packages      # GCS bucket for .py/.whl files
          gcs_prefix: udf-deps/                     # optional prefix/folder
          credentials: GCS_SA_KEY_PATH      # if different from the BQ connection creds

Key design decisions:

  • Dependency Staging config lives in profiles.yml, not dbt_project.yml, because it contains environment-specific settings and potential credentials. This is consistent with the mechanism other connection/infra config is handled.
  • Each adapter defines its own staging schema since the storage mechanism is fundamentally different per warehouse.
  • Credentials for artifact storage may differ from the main warehouse connection. e.g., a service account with GCS write access that is separate from the BigQuery query credentials. Adapters should support an optional credential override, falling back to the main connection credentials by default.

End-to-end flow

For a private package on a warehouse that doesn't support private PyPI natively:

  1. Resolve: dbt authenticates against the private index, resolves the package + transitive dependencies.
  2. Download: dbt downloads the resolved artifacts locally.
  3. Transform: dbt converts/repackages artifacts into the format the warehouse requires (.zip for Snowflake, .py for BigQuery, etc.).
  4. Upload: dbt uploads to the configured staging location (Snowflake stage, GCS bucket, DBFS path).
  5. Reference: The adapter generates the CREATE FUNCTION statement referencing the staged artifacts (IMPORTS for Snowflake, library for BigQuery, etc.).

Extras

  1. Since the index credentials are stored in profiles.yml , dbt debug should validate the connection to the private repository
  2. We might also need to validate if the dependencies actually exist in the private repository as a pre-validation mechanism

Acceptance criteria

  • Users can configure one or more private PyPI indexes in dbt_project.yml with credential references
  • The packages config on functions/models supports specifying which index to resolve from
  • Users can configure warehouse-specific artifact staging locations in profiles.yml
  • Adapters can authenticate against staging storage (with optional credential override)
  • Credentials are never stored in plain text — only env var references
  • Public PyPI remains the default when no index is specified (backward compatible)
  • Documentation covers setup for common private registries (Artifactory, CodeArtifact, GCP Artifact Registry) and staging configuration per warehouse

Implementation considerations

  • dbt-core scope: Package index configuration, credential management, resolution/download logic,
    the extended packages schema, and a standard interface for adapters to receive resolved artifacts.
  • dbt-adapters scope: Warehouse-specific artifact transformation, upload to staging storage, and CREATE FUNCTION statement generation. See dbt-adapters#1651 for the adapter-side discussion.
  • Dependency resolution: When downloading for artifact-based delivery, dbt would need to resolve transitive dependencies. We should consider leveraging pip download or a library like resolvelib rather than reimplementing resolution.
  • Caching: Downloaded and transformed artifacts should be cached locally to avoid re-downloading
    on every run. We could consider a content-addressed cache keyed on package name + version + platform.

Describe alternatives you've considered

I could do all of this manually and it will work. However, dbt providing a standard interface for this across adapters will be a great ease-of-use feature

Who will this benefit?

Anyone who needs to use private PyPI packages in their python UDFs.

Are you interested in contributing this feature?

YES

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions