S3 deduplication proxy server with Filetracker protocol compatibility.
s3dedup is an S3 proxy layer that adds content-based deduplication capabilities while maintaining backwards compatibility with the Filetracker protocol (v2). Files with identical content are stored only once in S3, reducing storage costs and improving efficiency.
- Content Deduplication: Files are stored by SHA256 hash, identical content is stored only once
- Filetracker Compatible: Drop-in replacement for legacy Filetracker servers
- Pluggable Storage: Support for SQLite and PostgreSQL metadata storage
- Distributed Locking: PostgreSQL advisory locks for distributed, high-availability deployments
- Migration Support: Offline and live migration from old Filetracker instances
- Auto Cleanup: Background cleaner removes unreferenced S3 objects
- Single-instance per bucket: Each instance handles exactly one bucket; scale horizontally with multiple instances
Pull the image from GitHub Container Registry:
docker pull ghcr.io/sio2project/s3dedup:latestRun with environment variables:
docker run -d \
--name s3dedup \
-p 8080:8080 \
-v s3dedup-data:/app/data \
-e S3_ENDPOINT=http://minio:9000 \
-e S3_ACCESS_KEY=minioadmin \
-e S3_SECRET_KEY=minioadmin \
ghcr.io/sio2project/s3dedup:latestOr use an environment file:
# Copy and customize .env.example
cp .env.example .env
# Run with env file
docker run -d \
--name s3dedup \
-p 8080:8080 \
-v s3dedup-data:/app/data \
--env-file .env \
ghcr.io/sio2project/s3dedup:latest| Variable | Default | Description |
|---|---|---|
LOG_LEVEL |
info |
Logging level (trace, debug, info, warn, error) |
LOG_JSON |
false |
Enable JSON logging |
BUCKET_NAME |
default |
Bucket name identifier |
LISTEN_ADDRESS |
0.0.0.0 |
Server bind address |
LISTEN_PORT |
8080 |
Server port |
KVSTORAGE_TYPE |
sqlite |
KV storage backend (sqlite, postgres) |
SQLITE_PATH |
/app/data/kv.db |
SQLite database path |
SQLITE_MAX_CONNECTIONS |
10 |
SQLite connection pool size |
LOCKS_TYPE |
memory |
Lock manager backend (memory, postgres) |
S3_ENDPOINT |
required | S3/MinIO endpoint URL |
S3_ACCESS_KEY |
required | S3 access key |
S3_SECRET_KEY |
required | S3 secret key |
S3_FORCE_PATH_STYLE |
true |
Use path-style S3 URLs |
CLEANER_ENABLED |
true |
Enable background cleaner |
CLEANER_INTERVAL |
3600 |
Cleaner run interval (seconds) |
CLEANER_BATCH_SIZE |
1000 |
Cleaner batch size |
CLEANER_MAX_DELETES |
10000 |
Max deletions per cleaner run |
FILETRACKER_URL |
- | Old Filetracker URL for live migration (HTTP fallback) |
FILETRACKER_V1_DIR |
- | V1 Filetracker directory for filesystem-based migration |
For PostgreSQL KV storage, use:
KVSTORAGE_TYPE=postgres
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=password
POSTGRES_DB=s3dedup
POSTGRES_MAX_CONNECTIONS=10
For distributed locking across multiple instances in high-availability setups, enable PostgreSQL-based advisory locks:
LOCKS_TYPE=postgres
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=password
POSTGRES_DB=s3dedup
POSTGRES_MAX_CONNECTIONS=10
When to Use:
- Single-instance deployments: Use default memory-based locking (LOCKS_TYPE=memory)
- Multi-instance HA deployments: Use PostgreSQL-based locking for coordinated access
Note: PostgreSQL locks share the connection pool with KV storage. Ensure sufficient pool size for concurrent operations. See DEVELOPMENT.md for implementation details.
The POSTGRES_MAX_CONNECTIONS setting controls the maximum number of concurrent database connections. This pool is shared between KV storage operations and lock management.
Quick Start Recommendations:
- Development:
POSTGRES_MAX_CONNECTIONS=10 - Small Production (1-3 instances):
POSTGRES_MAX_CONNECTIONS=20-30 - Large Production (5+ instances):
POSTGRES_MAX_CONNECTIONS=50-100
For detailed pool sizing guidance, monitoring strategies, and tuning considerations, see DEVELOPMENT.md.
Alternatively, use a JSON config file:
docker run -d \
-p 8080:8080 \
-v $(pwd)/config.json:/app/config.json \
-v s3dedup-data:/app/data \
ghcr.io/sio2project/s3dedup:latest \
server --config /app/config.jsonEnvironment variables override config file values.
s3dedup follows a single-bucket-per-instance design pattern, consistent with 12-factor application principles:
- One Instance = One Bucket: Each s3dedup instance manages exactly one S3 bucket and serves one Filetracker endpoint
- Horizontal Scaling: For multiple buckets, run multiple s3dedup instances (one per bucket)
- Simplified Configuration: Cleaner config files, easier to reason about, better for container orchestration
For a single bucket with high availability, run multiple instances with PostgreSQL locks and shared database:
# All instances share the same PostgreSQL database and use PostgreSQL locks
docker run -d \
--name s3dedup-ha-1 \
-p 8001:8080 \
-e BUCKET_NAME=files \
-e LISTEN_PORT=8080 \
-e KVSTORAGE_TYPE=postgres \
-e LOCKS_TYPE=postgres \
-e POSTGRES_HOST=postgres-db \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=s3dedup \
-e S3_ENDPOINT=http://minio:9000 \
-e S3_ACCESS_KEY=minioadmin \
-e S3_SECRET_KEY=minioadmin \
ghcr.io/sio2project/s3dedup:latest server --env
# Repeat for instances 2, 3, etc., on different portsBenefits of HA Setup:
- Load Balancing: Requests can be distributed across multiple instances
- Fault Tolerance: If one instance fails, others continue serving requests
- Coordinated Access: PostgreSQL locks ensure safe concurrent file operations
- Shared Metadata: Single database prevents data inconsistency
📖 Complete Migration Guide: See docs/migration.md for comprehensive migration instructions
s3dedup supports migration from both Filetracker V1 (filesystem-based) and V2 (HTTP-based) servers.
Migrate all files from Filetracker V2 via HTTP while the proxy is offline:
docker run --rm \
--env-file .env \
-v s3dedup-data:/app/data \
ghcr.io/sio2project/s3dedup:latest \
migrate --env \
--filetracker-url http://old-filetracker:8000 \
--max-concurrency 10Run the proxy while migrating in the background:
# Set FILETRACKER_URL in your .env file
echo "FILETRACKER_URL=http://old-filetracker:8000" >> .env
# Start in live migration mode
docker run -d \
--name s3dedup \
-p 8080:8080 \
-v s3dedup-data:/app/data \
--env-file .env \
ghcr.io/sio2project/s3dedup:latest \
live-migrate --env --max-concurrency 10During V2 live migration:
- GET: Falls back to old Filetracker if file not found, migrates on-the-fly
- PUT: Writes to both s3dedup and old Filetracker
- DELETE: Deletes from both systems
V1 Filetracker stores files directly on the filesystem and serves them via a simple HTTP protocol.
The key difference from V2 is that V1 doesn't have a /list/ endpoint for file discovery, so migration uses
filesystem walking.
Performance: V1 migration uses chunked processing to handle millions of files efficiently without loading all file paths into memory. The filesystem is scanned in chunks of 10,000 files, keeping memory usage constant regardless of total file count.
Migrate from V1 filesystem (requires access to $FILETRACKER_DIR):
docker run --rm \
--env-file .env \
-v s3dedup-data:/app/data \
-v /path/to/filetracker:/filetracker:ro \
ghcr.io/sio2project/s3dedup:latest \
migrate-v1 --env \
--v1-directory /filetracker \
--max-concurrency 10Run the proxy while migrating from V1 in the background:
# With both filesystem access and HTTP fallback
docker run -d \
--name s3dedup \
-p 8080:8080 \
-v s3dedup-data:/app/data \
-v /path/to/filetracker:/filetracker:ro \
--env-file .env \
ghcr.io/sio2project/s3dedup:latest \
live-migrate-v1 --env \
--v1-directory /filetracker \
--filetracker-url http://old-filetracker-v1:8000 \
--max-concurrency 10
# Or with HTTP fallback only (no filesystem access)
docker run -d \
--name s3dedup \
-p 8080:8080 \
-v s3dedup-data:/app/data \
--env-file .env \
ghcr.io/sio2project/s3dedup:latest \
live-migrate-v1 --env \
--filetracker-url http://old-filetracker-v1:8000 \
--max-concurrency 10During V1 live migration:
- Background filesystem migration: If
--v1-directoryis provided, filesystem is scanned in chunks to migrate all files- Chunked processing handles millions of files with constant memory usage
- HTTP fallback: If
--filetracker-urlis provided, GET requests fall back to V1 server if file not found- Automatically migrates files on first access
- New requests: Server accepts PUT/GET/DELETE requests normally during migration
For detailed migration strategies, performance tuning, troubleshooting, and rollback procedures, see the Migration Guide.
Compatible with Filetracker protocol v2:
GET /ft/version- Get protocol versionGET /ft/list/{path}- List filesGET /ft/files/{path}- Download fileHEAD /ft/files/{path}- Get file metadataPUT /ft/files/{path}- Upload fileDELETE /ft/files/{path}- Delete file
For comprehensive testing guide, see DEVELOPMENT.md.
Quick start:
# Run unit tests (no external dependencies)
cargo test --lib
# Run all tests (requires PostgreSQL + MinIO)
docker-compose up -d
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/s3dedup_test"
cargo test
docker-compose downSee DEVELOPMENT.md for detailed development instructions including:
- Building from source
- Running tests with different configurations
- PostgreSQL advisory lock implementation details
- Contributing guidelines
- Performance considerations
Quick start:
# Run with Docker Compose (includes PostgreSQL + MinIO)
docker-compose up
# In another terminal, run tests
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/s3dedup_test"
cargo test- API Layer: Axum-based HTTP server with Filetracker routes
- Deduplication: SHA256-based content addressing
- Storage Backend: S3-compatible object storage (MinIO, AWS S3, etc.)
- Metadata Store: SQLite or PostgreSQL for file metadata and reference counts
- Lock Manager: In-memory (single-instance) or PostgreSQL advisory locks (distributed, multi-instance HA)
- Memory locks: Fast, suitable for single-instance deployments
- PostgreSQL locks: Distributed coordination, suitable for multi-instance HA setups
- Cleaner: Background worker that removes unreferenced S3 objects
For detailed architecture documentation, see:
- docs/deduplication.md - Deduplication architecture and performance
- DEVELOPMENT.md - Lock implementation details and code architecture
- Development Guide - Building, testing, lock implementation details, and contributing
- Migration Guide - Migrating from Filetracker v2.1+ (offline and live migration strategies)
- Deduplication Architecture - How content-based deduplication works, data flows, and performance characteristics
See LICENSE file for details.