Skip to content

Conversation

@pyramation
Copy link
Contributor

@pyramation pyramation commented Jan 11, 2026

Summary

Adds the pg_lake extension from Snowflake Labs via a new separate Docker image. This PR creates two images:

  1. postgres-plus (Alpine) - The existing lean image with pgvector, PostGIS, pg_textsearch, pgsodium (unchanged)
  2. postgres-plus-lake (Debian) - New image with all the above extensions PLUS pg_lake for Iceberg and data lake access

The pg_lake image uses Debian and builds PostgreSQL 17.7 from source because pg_lake requires PostgreSQL internal headers (like server/rewrite/rewriteManip.h) that aren't available in pre-built packages.

Changes

  • Added Dockerfile.pg_lake - Debian-based multi-stage build with PostgreSQL compiled from source
  • Updated CI workflow to build both images on amd64 and arm64
  • Updated Makefile with build-lake, test-lake, run-lake targets
  • Updated README to document both images

Note: DuckDB/pgduck_server integration is not included due to build complexity (requires vcpkg, Azure SDK, etc.). The core pg_lake extensions still provide Iceberg table support and data lake file access.

Review & Testing Checklist for Human

  • Test postgres-plus-lake image locally (make build-lake && make test-lake) - CI passes but I did not test locally
  • Test pg_lake functionality - try CREATE EXTENSION pg_lake; and basic operations (Iceberg table, Parquet file query)
  • Verify postgres-plus image unchanged - confirm the Alpine image still works as expected
  • Consider pinning pg_lake version - currently defaults to main branch; may want to pin to a specific release tag

Recommended Test Plan

# Test the pg_lake image
make build-lake
make test-lake

# Manual verification
docker run -d --name pg-lake-test -e POSTGRES_PASSWORD=test -p 5432:5432 constructive/postgres-lake:latest
docker exec -it pg-lake-test psql -U postgres -c "CREATE EXTENSION pg_lake; SELECT * FROM pg_available_extensions WHERE name LIKE 'pg_%';"
docker stop pg-lake-test && docker rm pg-lake-test

Notes

  • pg_lake requires PostgreSQL built from source for internal headers - this is why the image uses Debian instead of Alpine
  • The Debian image is larger than Alpine but necessary for pg_lake compatibility
  • Runtime dependencies (snappy, jansson, lz4, xz, zstd, krb5, curl) are included for pg_lake
  • The avro library is copied separately to /usr/local/lib/ since it's not a PostgreSQL extension

Link to Devin run: https://app.devin.ai/sessions/1dbdc63238494ca8845442ba53c957b5
Requested by: Dan Lynch (@pyramation)

- Add pg_lake extension from Snowflake Labs to the Docker image
- Build pg_lake with all core extensions (pg_map, pg_extension_base,
  pg_extension_updater, pg_lake_engine, pg_lake_copy, pg_lake_iceberg,
  pg_lake_table, pg_lake)
- Build and include Apache Avro library required by pg_lake
- Add runtime dependencies for pg_lake (snappy, jansson, lz4, xz, zstd, libpq)
- Update README to document pg_lake extension
- Update Makefile test to verify pg_lake extension loads correctly

Note: DuckDB/pgduck_server integration is not included due to Alpine Linux
compatibility constraints. The core pg_lake extensions provide Iceberg table
support and data lake file access capabilities.
@devin-ai-integration
Copy link

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

The pg_lake_engine extension requires gssapi/gssapi.h from the PostgreSQL
server headers, which depends on the Kerberos development package.
- Revert main Dockerfile to original Alpine-based image (postgres-plus)
- Add new Dockerfile.pg_lake with Debian base for pg_lake compatibility
- Update CI workflow to build both images (postgres-plus and postgres-plus-lake)
- Update Makefile with targets for both images (build-lake, test-lake, etc.)
- Update README to document both images

The pg_lake extension requires PostgreSQL server internal headers that aren't
available in Alpine-based images, so it uses a Debian (bookworm) base instead.
The parallel make was causing race conditions with the raster module
even when configured with --without-raster.
CMake couldn't find libjansson because pkg-config was missing.
pg_lake requires PostgreSQL internal headers (like server/rewrite/rewriteManip.h)
that are not available in pre-built packages. This follows pg_lake's official
Dockerfile approach of building PostgreSQL from source.

Changes:
- Use debian:bookworm-slim as base instead of postgres:17-bookworm
- Build PostgreSQL 17.2 from source with required configure flags
- Include simple entrypoint script for database initialization
- Add all required runtime dependencies
@devin-ai-integration devin-ai-integration bot changed the title Add pg_lake extension for Iceberg and data lake access Add pg_lake extension via separate Debian-based image Jan 11, 2026
@Jovonni Jovonni requested a review from Copilot January 11, 2026 14:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for the pg_lake extension (for Iceberg and data lake access) via a new separate Debian-based Docker image while maintaining the existing lean Alpine-based image unchanged.

Changes:

  • Added a new postgres-plus-lake image that builds PostgreSQL 17.7 from source (required for pg_lake internal headers)
  • Extended CI workflow to build and publish both Alpine and Debian images for amd64 and arm64 platforms
  • Updated Makefile with parallel targets for building and testing both images

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File Description
Dockerfile.pg_lake New multi-stage Debian-based Dockerfile that builds PostgreSQL from source and includes all extensions plus pg_lake
.github/workflows/docker.yml Extended to build both postgres-plus and postgres-plus-lake images with proper digest handling and manifest creation
Makefile Added lake-specific targets (build-lake, test-lake, run-lake) alongside existing targets
README.md Updated documentation to describe both images with separate usage examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if [ -z "$(ls -A "$PGDATA" 2>/dev/null)" ]; then
gosu postgres initdb --username=postgres --pwfile=<(echo "${POSTGRES_PASSWORD:-postgres}")

# Allow connections from anywhere
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pg_hba.conf configuration allows MD5 password authentication from all hosts (0.0.0.0/0). While this is typical for development containers, it poses a security risk if accidentally used in production. Consider adding a comment warning that this configuration is for development/testing only, or use 'scram-sha-256' instead of 'md5' for better password security.

Suggested change
# Allow connections from anywhere
# Allow connections from anywhere (development/testing only; not recommended for production)
echo "# WARNING: The following pg_hba.conf entry is intended for development/testing only and is not safe for production use." >> "$PGDATA/pg_hba.conf"

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +78
./configure --prefix=${PGBASEDIR} \
--with-openssl \
--with-libxml \
--with-libxslt \
--with-icu \
--with-uuid=ossp \
--with-lz4 && \
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PostgreSQL source build is missing the '--enable-debug' flag consideration. While RelWithDebInfo is used for the avro library build (line 123), the PostgreSQL itself is built without explicitly setting build type or debug symbols. For production images, this is fine, but consider documenting why debug symbols are intentionally excluded or adding a comment about the chosen build configuration.

Copilot uses AI. Check for mistakes.
fi
digests=""
for digest_file in "$digest_dir"/*; do
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for loop may fail if the digest directory contains unexpected files. The pattern "$digest_dir"/* will match all files including potential hidden files or non-digest files. Consider adding a check to verify the loop actually processes files: add [ ! -e "$digest_file" ] && continue after the for loop declaration, or use a more specific pattern that validates digest filenames.

Suggested change
for digest_file in "$digest_dir"/*; do
for digest_file in "$digest_dir"/*; do
[ ! -e "$digest_file" ] && continue

Copilot uses AI. Check for mistakes.
ARG POSTGIS_VERSION=3.5.1
ARG PG_TEXTSEARCH_VERSION=0.2.0
ARG PGSODIUM_VERSION=3.1.9
ARG PG_LAKE_VERSION=main
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 'main' branch as the default version for pg_lake is risky for production use. The 'main' branch can contain unstable or breaking changes. Consider pinning to a specific release tag or commit SHA to ensure reproducible builds and prevent unexpected breakage.

Suggested change
ARG PG_LAKE_VERSION=main
ARG PG_LAKE_VERSION=v0.1.0

Copilot uses AI. Check for mistakes.
RUN mkdir -p "$PGDATA" && chown -R postgres:postgres "$PGDATA" && chmod 1777 "$PGDATA"

# Create run directory for socket
RUN mkdir -p /var/run/postgresql && chown -R postgres:postgres /var/run/postgresql && chmod 3777 /var/run/postgresql
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chmod value '3777' sets both sticky bit and setgid bit on the socket directory. While PostgreSQL runtime directories often use '2775' or '2777', the '3777' (sticky + setgid + full permissions) is overly permissive. Consider using '2775' (setgid + rwxrwxr-x) which is more typical for PostgreSQL socket directories and provides adequate functionality with better security.

Suggested change
RUN mkdir -p /var/run/postgresql && chown -R postgres:postgres /var/run/postgresql && chmod 3777 /var/run/postgresql
RUN mkdir -p /var/run/postgresql && chown -R postgres:postgres /var/run/postgresql && chmod 2775 /var/run/postgresql

Copilot uses AI. Check for mistakes.
Comment on lines 3 to +4
IMAGE_NAME ?= constructive/postgres
IMAGE_NAME_LAKE ?= constructive/postgres-lake
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a naming inconsistency between the Makefile and the CI workflow. The Makefile uses image names 'constructive/postgres' and 'constructive/postgres-lake', but the CI workflow and README reference 'postgres-plus' and 'postgres-plus-lake'. This mismatch will cause confusion when developers try to use local Makefile commands versus pulling from the registry. Consider aligning the Makefile IMAGE_NAME to match the published names (e.g., 'constructive/postgres-plus').

Suggested change
IMAGE_NAME ?= constructive/postgres
IMAGE_NAME_LAKE ?= constructive/postgres-lake
IMAGE_NAME ?= constructive/postgres-plus
IMAGE_NAME_LAKE ?= constructive/postgres-plus-lake

Copilot uses AI. Check for mistakes.
RUN ldconfig

# Create data directory
RUN mkdir -p "$PGDATA" && chown -R postgres:postgres "$PGDATA" && chmod 1777 "$PGDATA"
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chmod value '1777' sets the sticky bit on the PGDATA directory. While this may work, PostgreSQL typically expects '0700' permissions for the data directory for security reasons. The sticky bit (1000) is usually used for shared directories like /tmp, not for database data directories. This could lead to permission issues or security concerns.

Suggested change
RUN mkdir -p "$PGDATA" && chown -R postgres:postgres "$PGDATA" && chmod 1777 "$PGDATA"
RUN mkdir -p "$PGDATA" && chown -R postgres:postgres "$PGDATA" && chmod 0700 "$PGDATA"

Copilot uses AI. Check for mistakes.

# If PGDATA is empty, initialize the database
if [ -z "$(ls -A "$PGDATA" 2>/dev/null)" ]; then
gosu postgres initdb --username=postgres --pwfile=<(echo "${POSTGRES_PASSWORD:-postgres}")
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entrypoint script uses process substitution with gosu which may not work reliably in all shell environments. The pwfile option with process substitution --pwfile=<(echo ...) can fail in dash or other minimal shells. Consider writing the password to a temporary file and cleaning it up, or using stdin with echo ... | gosu postgres initdb ... --pwfile=/dev/stdin for better compatibility.

Suggested change
gosu postgres initdb --username=postgres --pwfile=<(echo "${POSTGRES_PASSWORD:-postgres}")
echo "${POSTGRES_PASSWORD:-postgres}" | gosu postgres initdb --username=postgres --pwfile=/dev/stdin

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +166
# Note: pg_lake image uses Dockerfile default PG_VERSION (full version like 17.7)
# because it builds PostgreSQL from source and needs exact version number
cache-from: type=gha,scope=postgres-plus-lake-${{ matrix.arch }}
cache-to: type=gha,mode=max,scope=postgres-plus-lake-${{ matrix.arch }}
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions that pg_lake image uses "Dockerfile default PG_VERSION (full version like 17.7)" but there's a mismatch: the workflow sets PG_VERSION='17' (major version), while Dockerfile.pg_lake defaults to PG_VERSION=17.7 (full version). Since the build doesn't pass PG_VERSION as a build-arg, the Dockerfile will always use 17.7 regardless of the workflow env var. The comments should be clarified to explain that the pg_lake build intentionally uses its hardcoded version (17.7) because it needs to download PostgreSQL source, while the Alpine image uses the workflow's major version (17) to reference pre-built postgres:17-alpine base images.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +42
| Extension | Description |
|-----------|-------------|
| [pgvector](https://github.com/pgvector/pgvector) | Vector similarity search for embeddings |
| [PostGIS](https://postgis.net/) | Spatial and geographic data |
| [pg_textsearch](https://www.tigerdata.com/docs/use-timescale/latest/extensions/pg-textsearch) | BM25 full-text search |
| [pgsodium](https://github.com/michelp/pgsodium) | Encryption using libsodium |
| [pg_lake](https://github.com/Snowflake-Labs/pg_lake) | Iceberg and data lake access |
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extension table for postgres-plus-lake duplicates all entries from postgres-plus. This creates maintenance burden as any changes to the core extensions would need to be updated in two places. Consider restructuring the documentation to list the core extensions once, then clearly indicate that postgres-plus-lake includes all of those plus pg_lake.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants