Skip to content

S3 deduplication proxy server with Filetracker protocol compatibility.

License

sio2project/s3dedup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

s3dedup

S3 deduplication proxy server with Filetracker protocol compatibility.

Overview

s3dedup is an S3 proxy layer that adds content-based deduplication capabilities while maintaining backwards compatibility with the Filetracker protocol (v2). Files with identical content are stored only once in S3, reducing storage costs and improving efficiency.

Features

  • Content Deduplication: Files are stored by SHA256 hash, identical content is stored only once
  • Filetracker Compatible: Drop-in replacement for legacy Filetracker servers
  • Pluggable Storage: Support for SQLite and PostgreSQL metadata storage
  • Distributed Locking: PostgreSQL advisory locks for distributed, high-availability deployments
  • Migration Support: Offline and live migration from old Filetracker instances
  • Auto Cleanup: Background cleaner removes unreferenced S3 objects
  • Single-instance per bucket: Each instance handles exactly one bucket; scale horizontally with multiple instances

Quick Start with Docker

Pull the image from GitHub Container Registry:

docker pull ghcr.io/sio2project/s3dedup:latest

Run with environment variables:

docker run -d \
  --name s3dedup \
  -p 8080:8080 \
  -v s3dedup-data:/app/data \
  -e S3_ENDPOINT=http://minio:9000 \
  -e S3_ACCESS_KEY=minioadmin \
  -e S3_SECRET_KEY=minioadmin \
  ghcr.io/sio2project/s3dedup:latest

Or use an environment file:

# Copy and customize .env.example
cp .env.example .env

# Run with env file
docker run -d \
  --name s3dedup \
  -p 8080:8080 \
  -v s3dedup-data:/app/data \
  --env-file .env \
  ghcr.io/sio2project/s3dedup:latest

Configuration

Environment Variables

Variable Default Description
LOG_LEVEL info Logging level (trace, debug, info, warn, error)
LOG_JSON false Enable JSON logging
BUCKET_NAME default Bucket name identifier
LISTEN_ADDRESS 0.0.0.0 Server bind address
LISTEN_PORT 8080 Server port
KVSTORAGE_TYPE sqlite KV storage backend (sqlite, postgres)
SQLITE_PATH /app/data/kv.db SQLite database path
SQLITE_MAX_CONNECTIONS 10 SQLite connection pool size
LOCKS_TYPE memory Lock manager backend (memory, postgres)
S3_ENDPOINT required S3/MinIO endpoint URL
S3_ACCESS_KEY required S3 access key
S3_SECRET_KEY required S3 secret key
S3_FORCE_PATH_STYLE true Use path-style S3 URLs
CLEANER_ENABLED true Enable background cleaner
CLEANER_INTERVAL 3600 Cleaner run interval (seconds)
CLEANER_BATCH_SIZE 1000 Cleaner batch size
CLEANER_MAX_DELETES 10000 Max deletions per cleaner run
FILETRACKER_URL - Old Filetracker URL for live migration (HTTP fallback)
FILETRACKER_V1_DIR - V1 Filetracker directory for filesystem-based migration

PostgreSQL Configuration

For PostgreSQL KV storage, use:

KVSTORAGE_TYPE=postgres
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=password
POSTGRES_DB=s3dedup
POSTGRES_MAX_CONNECTIONS=10

Distributed Locking (PostgreSQL Advisory Locks)

For distributed locking across multiple instances in high-availability setups, enable PostgreSQL-based advisory locks:

LOCKS_TYPE=postgres
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=password
POSTGRES_DB=s3dedup
POSTGRES_MAX_CONNECTIONS=10

When to Use:

  • Single-instance deployments: Use default memory-based locking (LOCKS_TYPE=memory)
  • Multi-instance HA deployments: Use PostgreSQL-based locking for coordinated access

Note: PostgreSQL locks share the connection pool with KV storage. Ensure sufficient pool size for concurrent operations. See DEVELOPMENT.md for implementation details.

Connection Pool Sizing

The POSTGRES_MAX_CONNECTIONS setting controls the maximum number of concurrent database connections. This pool is shared between KV storage operations and lock management.

Quick Start Recommendations:

  • Development: POSTGRES_MAX_CONNECTIONS=10
  • Small Production (1-3 instances): POSTGRES_MAX_CONNECTIONS=20-30
  • Large Production (5+ instances): POSTGRES_MAX_CONNECTIONS=50-100

For detailed pool sizing guidance, monitoring strategies, and tuning considerations, see DEVELOPMENT.md.

Config File

Alternatively, use a JSON config file:

docker run -d \
  -p 8080:8080 \
  -v $(pwd)/config.json:/app/config.json \
  -v s3dedup-data:/app/data \
  ghcr.io/sio2project/s3dedup:latest \
  server --config /app/config.json

Environment variables override config file values.

Deployment and Scaling

Single-Instance per Bucket Architecture

s3dedup follows a single-bucket-per-instance design pattern, consistent with 12-factor application principles:

  • One Instance = One Bucket: Each s3dedup instance manages exactly one S3 bucket and serves one Filetracker endpoint
  • Horizontal Scaling: For multiple buckets, run multiple s3dedup instances (one per bucket)
  • Simplified Configuration: Cleaner config files, easier to reason about, better for container orchestration

High-Availability Deployments

For a single bucket with high availability, run multiple instances with PostgreSQL locks and shared database:

# All instances share the same PostgreSQL database and use PostgreSQL locks
docker run -d \
  --name s3dedup-ha-1 \
  -p 8001:8080 \
  -e BUCKET_NAME=files \
  -e LISTEN_PORT=8080 \
  -e KVSTORAGE_TYPE=postgres \
  -e LOCKS_TYPE=postgres \
  -e POSTGRES_HOST=postgres-db \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=password \
  -e POSTGRES_DB=s3dedup \
  -e S3_ENDPOINT=http://minio:9000 \
  -e S3_ACCESS_KEY=minioadmin \
  -e S3_SECRET_KEY=minioadmin \
  ghcr.io/sio2project/s3dedup:latest server --env

# Repeat for instances 2, 3, etc., on different ports

Benefits of HA Setup:

  • Load Balancing: Requests can be distributed across multiple instances
  • Fault Tolerance: If one instance fails, others continue serving requests
  • Coordinated Access: PostgreSQL locks ensure safe concurrent file operations
  • Shared Metadata: Single database prevents data inconsistency

Migration

📖 Complete Migration Guide: See docs/migration.md for comprehensive migration instructions

s3dedup supports migration from both Filetracker V1 (filesystem-based) and V2 (HTTP-based) servers.

V2 Migration (Filetracker 2.1+)

Offline Migration

Migrate all files from Filetracker V2 via HTTP while the proxy is offline:

docker run --rm \
  --env-file .env \
  -v s3dedup-data:/app/data \
  ghcr.io/sio2project/s3dedup:latest \
  migrate --env \
  --filetracker-url http://old-filetracker:8000 \
  --max-concurrency 10

Live Migration (Zero Downtime)

Run the proxy while migrating in the background:

# Set FILETRACKER_URL in your .env file
echo "FILETRACKER_URL=http://old-filetracker:8000" >> .env

# Start in live migration mode
docker run -d \
  --name s3dedup \
  -p 8080:8080 \
  -v s3dedup-data:/app/data \
  --env-file .env \
  ghcr.io/sio2project/s3dedup:latest \
  live-migrate --env --max-concurrency 10

During V2 live migration:

  • GET: Falls back to old Filetracker if file not found, migrates on-the-fly
  • PUT: Writes to both s3dedup and old Filetracker
  • DELETE: Deletes from both systems

V1 Migration (Legacy Filetracker)

V1 Filetracker stores files directly on the filesystem and serves them via a simple HTTP protocol. The key difference from V2 is that V1 doesn't have a /list/ endpoint for file discovery, so migration uses filesystem walking.

Performance: V1 migration uses chunked processing to handle millions of files efficiently without loading all file paths into memory. The filesystem is scanned in chunks of 10,000 files, keeping memory usage constant regardless of total file count.

Offline Migration

Migrate from V1 filesystem (requires access to $FILETRACKER_DIR):

docker run --rm \
  --env-file .env \
  -v s3dedup-data:/app/data \
  -v /path/to/filetracker:/filetracker:ro \
  ghcr.io/sio2project/s3dedup:latest \
  migrate-v1 --env \
  --v1-directory /filetracker \
  --max-concurrency 10

Live Migration

Run the proxy while migrating from V1 in the background:

# With both filesystem access and HTTP fallback
docker run -d \
  --name s3dedup \
  -p 8080:8080 \
  -v s3dedup-data:/app/data \
  -v /path/to/filetracker:/filetracker:ro \
  --env-file .env \
  ghcr.io/sio2project/s3dedup:latest \
  live-migrate-v1 --env \
  --v1-directory /filetracker \
  --filetracker-url http://old-filetracker-v1:8000 \
  --max-concurrency 10

# Or with HTTP fallback only (no filesystem access)
docker run -d \
  --name s3dedup \
  -p 8080:8080 \
  -v s3dedup-data:/app/data \
  --env-file .env \
  ghcr.io/sio2project/s3dedup:latest \
  live-migrate-v1 --env \
  --filetracker-url http://old-filetracker-v1:8000 \
  --max-concurrency 10

During V1 live migration:

  • Background filesystem migration: If --v1-directory is provided, filesystem is scanned in chunks to migrate all files
    • Chunked processing handles millions of files with constant memory usage
  • HTTP fallback: If --filetracker-url is provided, GET requests fall back to V1 server if file not found
    • Automatically migrates files on first access
  • New requests: Server accepts PUT/GET/DELETE requests normally during migration

For detailed migration strategies, performance tuning, troubleshooting, and rollback procedures, see the Migration Guide.

API Endpoints

Compatible with Filetracker protocol v2:

  • GET /ft/version - Get protocol version
  • GET /ft/list/{path} - List files
  • GET /ft/files/{path} - Download file
  • HEAD /ft/files/{path} - Get file metadata
  • PUT /ft/files/{path} - Upload file
  • DELETE /ft/files/{path} - Delete file

Testing

For comprehensive testing guide, see DEVELOPMENT.md.

Quick start:

# Run unit tests (no external dependencies)
cargo test --lib

# Run all tests (requires PostgreSQL + MinIO)
docker-compose up -d
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/s3dedup_test"
cargo test
docker-compose down

Development

See DEVELOPMENT.md for detailed development instructions including:

  • Building from source
  • Running tests with different configurations
  • PostgreSQL advisory lock implementation details
  • Contributing guidelines
  • Performance considerations

Quick start:

# Run with Docker Compose (includes PostgreSQL + MinIO)
docker-compose up

# In another terminal, run tests
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/s3dedup_test"
cargo test

Architecture

  • API Layer: Axum-based HTTP server with Filetracker routes
  • Deduplication: SHA256-based content addressing
  • Storage Backend: S3-compatible object storage (MinIO, AWS S3, etc.)
  • Metadata Store: SQLite or PostgreSQL for file metadata and reference counts
  • Lock Manager: In-memory (single-instance) or PostgreSQL advisory locks (distributed, multi-instance HA)
    • Memory locks: Fast, suitable for single-instance deployments
    • PostgreSQL locks: Distributed coordination, suitable for multi-instance HA setups
  • Cleaner: Background worker that removes unreferenced S3 objects

For detailed architecture documentation, see:

Documentation

License

See LICENSE file for details.

About

S3 deduplication proxy server with Filetracker protocol compatibility.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 2

  •  
  •