-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Is this your first time submitting a feature request?
- I have read the expectations for open source contributors
- I have searched the existing issues, and I could not find an existing issue for this feature
- I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion
Describe the feature
Support Private PyPI packages for Python UDFs
While #12041 adds support for specifying public PyPI packages in Python UDFs via the packages config, many organizations rely on private PyPI repositories (e.g., Artifactory, AWS CodeArtifact, GCP Artifact Registry, Azure Artifacts, Nexus, Gemfury) to host internal Python packages. There is currently no way to configure dbt to authenticate against a private package index when resolving UDF packages.
Additionally, warehouses vary in how they accept custom/private code — Snowflake supports uploading .zip files via stages, BigQuery supports importing .py files from Cloud Storage — but neither supports authenticating against a private PyPI index directly. dbt could bridge this gap by resolving packages from a private index and preparing them in the format each warehouse requires.
Motivation
- Teams building internal Python libraries (e.g., shared feature engineering, custom ML models, proprietary transforms) cannot use them in dbt Python UDFs today.
- Warehouse-native package support is limited to public PyPI:
- Snowflake: Supports public PyPI via ARTIFACT_REPOSITORY + PACKAGES, but docs explicitly state "Access to private repositories is not supported." Custom code can be uploaded as .zip files to Snowflake stages and referenced via IMPORTS.
- BigQuery: Supports public PyPI via
OPTIONS(packages=["..."])(wheels only). Custom Python files can be imported from GCS viaOPTIONS(library=["gs://..."]), but this is limited to.pyfiles, not full package archives.
- A dbt-native solution would provide a consistent, cross-adapter experience for private package consumption.
Proposed design
1. New package-indexes configuration (project-level)
Allow users to configure one or more private PyPI indexes in dbt_project.yml:
package-indexes:
- name: internal
url: https://pypi.internal.company.com/simple/
auth:
type: token # or "basic", "env"
token: "{{ env_var('PRIVATE_PYPI_TOKEN') }}"
- name: aws-codeartifact
url: https://my-domain-123456789.d.codeartifact.us-east-1.amazonaws.com/pypi/my-repo/simple/
auth:
type: basic
username: "{{ env_var('CODEARTIFACT_USER') }}"
password: "{{ env_var('CODEARTIFACT_TOKEN') }}"auth.type supports common authentication patterns: token (bearer/API token), basic (username + password), and env (fully custom, e.g., for AWS_SESSION_TOKEN).
2. Extended packages config on UDFs/models
Allow package specs to reference a named index:
functions:
- name: my_function
config:
packages:
- scikit-learn # resolved from public PyPI (default)
- name: my-internal-lib
version: ">=1.2.0"
index: internal # references the named index above
- name: another-lib
index: aws-codeartifactThe simple string form (- scikit-learn) remains supported for backward compatibility with public PyPI.
3. Artifact staging configuration (adapter-specific)
Since warehouses require pre-built artifacts to be uploaded to warehouse-specific storage, users need to configure where dbt stages these artifacts and how to authenticate. This would live in profiles.yml alongside existing adapter connection config:
Snowflake:
my_snowflake_profile:
target: dev
outputs:
dev:
type: snowflake
# ... existing connection config ...
python_package_staging:
stage: my_db.my_schema.python_packages # Snowflake stage for .zip uploadsBigQuery:
my_bq_profile:
target: dev
outputs:
dev:
type: bigquery
# ... existing connection config ...
python_package_staging:
gcs_bucket: my-project-dbt-packages # GCS bucket for .py/.whl files
gcs_prefix: udf-deps/ # optional prefix/folder
credentials: GCS_SA_KEY_PATH # if different from the BQ connection credsKey design decisions:
- Dependency Staging config lives in
profiles.yml, notdbt_project.yml, because it contains environment-specific settings and potential credentials. This is consistent with the mechanism other connection/infra config is handled. - Each adapter defines its own staging schema since the storage mechanism is fundamentally different per warehouse.
- Credentials for artifact storage may differ from the main warehouse connection. e.g., a service account with GCS write access that is separate from the BigQuery query credentials. Adapters should support an optional credential override, falling back to the main connection credentials by default.
End-to-end flow
For a private package on a warehouse that doesn't support private PyPI natively:
- Resolve: dbt authenticates against the private index, resolves the package + transitive dependencies.
- Download: dbt downloads the resolved artifacts locally.
- Transform: dbt converts/repackages artifacts into the format the warehouse requires (.zip for Snowflake, .py for BigQuery, etc.).
- Upload: dbt uploads to the configured staging location (Snowflake stage, GCS bucket, DBFS path).
- Reference: The adapter generates the CREATE FUNCTION statement referencing the staged artifacts (IMPORTS for Snowflake, library for BigQuery, etc.).
Extras
- Since the index credentials are stored in
profiles.yml,dbt debugshould validate the connection to the private repository - We might also need to validate if the dependencies actually exist in the private repository as a pre-validation mechanism
Acceptance criteria
- Users can configure one or more private PyPI indexes in
dbt_project.ymlwith credential references - The packages config on functions/models supports specifying which index to resolve from
- Users can configure warehouse-specific artifact staging locations in
profiles.yml - Adapters can authenticate against staging storage (with optional credential override)
- Credentials are never stored in plain text — only env var references
- Public PyPI remains the default when no index is specified (backward compatible)
- Documentation covers setup for common private registries (Artifactory, CodeArtifact, GCP Artifact Registry) and staging configuration per warehouse
Implementation considerations
- dbt-core scope: Package index configuration, credential management, resolution/download logic,
the extended packages schema, and a standard interface for adapters to receive resolved artifacts. - dbt-adapters scope: Warehouse-specific artifact transformation, upload to staging storage, and
CREATE FUNCTIONstatement generation. See dbt-adapters#1651 for the adapter-side discussion. - Dependency resolution: When downloading for artifact-based delivery, dbt would need to resolve transitive dependencies. We should consider leveraging pip download or a library like resolvelib rather than reimplementing resolution.
- Caching: Downloaded and transformed artifacts should be cached locally to avoid re-downloading
on every run. We could consider a content-addressed cache keyed on package name + version + platform.
Describe alternatives you've considered
I could do all of this manually and it will work. However, dbt providing a standard interface for this across adapters will be a great ease-of-use feature
Who will this benefit?
Anyone who needs to use private PyPI packages in their python UDFs.
Are you interested in contributing this feature?
YES
Anything else?
No response