Skip to content

glassflow/glassflow-etl-k8s-operator

Repository files navigation

GlassFlow ETL Kubernetes Operator

GlassFlow Operator Banner

Build Status Release License Go Version Kubernetes

Email GlassFlow Hub Schedule Meeting

Enterprise-grade ETL pipeline orchestration for Kubernetes with seamless deduplication and joins

πŸš€ Quick Start β€’ πŸ“– Documentation β€’ πŸ—οΈ Architecture β€’ 🀝 Contributing


🎯 Overview

The GlassFlow ETL Kubernetes Operator is a production-ready Kubernetes operator that enables scalable, cloud-native data pipeline deployments. Built as a companion to the GlassFlow ClickHouse ETL project, it provides enterprise-grade data processing capabilities with advanced features like deduplication, temporal joins, and seamless pause/resume functionality.

✨ Key Features

  • πŸ”„ Pipeline Lifecycle Management - Create, pause, resume, and terminate data pipelines
  • 🎯 Advanced Deduplication - Built-in deduplication with configurable time windows
  • πŸ”— Stream Joins - Seamless joining of multiple data streams
  • ⚑ Kubernetes Native - Full CRD-based pipeline management
  • πŸ›‘οΈ Production Ready - Enterprise-grade reliability and monitoring
  • πŸ“Š Scalable Ingestor - Efficiently reads from multiple Kafka partitions with horizontal scaling
  • πŸ”§ Helm Charts - Easy deployment and configuration management

πŸ—οΈ Architecture

graph LR
    KAFKA[Kafka Cluster]
    
    subgraph "Kubernetes Cluster"
        subgraph "GlassFlow ETL"
            subgraph "Operator"
                OP[Operator Controller]
                CRD[Pipeline CRD]
            end
            
            subgraph "Data Pipeline"
                ING[Ingestor Pods]
                JOIN[Join Pod]
                SINK[Sink Pod]
            end
            
            subgraph NATS_JETSTREAM["NATS JetStream"]
                NATS[NATS]
                DLQ[DLQ]
            end
        end
    end
    
    CH[ClickHouse]
    
    subgraph "External"
        API[GlassFlow API]
        UI[Web UI]
    end
    
    API --> CRD
    CRD --> OP
    OP --> ING
    OP --> JOIN
    OP --> SINK
    
    ING <--> NATS_JETSTREAM
    JOIN <--> NATS_JETSTREAM
    SINK <--> NATS_JETSTREAM
    
    KAFKA --> ING
    SINK --> CH
    
    UI --> API
    
Loading

πŸš€ Quick Start

Prerequisites

  • Kubernetes 1.19+ cluster
  • Helm 3.2.0+
  • kubectl configured for your cluster
  • Kafka (optional - can use external setup for development)
  • ClickHouse (optional - can use external setup for development)

Option 1: Helm Chart (Recommended)

Deploy using the complete GlassFlow ETL stack from the GlassFlow Charts repository:

# Add GlassFlow Helm repository
helm repo add glassflow https://glassflow.github.io/charts
helm repo update

# Install complete GlassFlow ETL stack
helm install glassflow-etl glassflow/glassflow-etl

Option 2: Operator Only

Deploy just the operator as a dependency:

# Install operator chart
helm install glassflow-operator glassflow/glassflow-operator

Uninstalling the Operator

The operator includes automatic cleanup functionality that ensures all pipelines are immediately terminated when uninstalling:

helm uninstall glassflow-operator

This will:

  • βœ… Terminate all existing pipelines ungracefully
  • βœ… Clean up all resources (namespaces, deployments, NATS streams)
  • βœ… Remove Pipeline CRD definitions
  • βœ… Remove the operator deployment

For more details, see HELM_UNINSTALL.md.

Option 3: Manual Installation

# Clone the repository
git clone https://github.com/glassflow/glassflow-etl-k8s-operator.git
cd glassflow-etl-k8s-operator

# Install CRDs
make install

# Deploy operator
make deploy IMG=ghcr.io/glassflow/glassflow-etl-k8s-operator:latest

πŸ“– Documentation

Pipeline Management

Create pipelines using the GlassFlow ClickHouse ETL backend API. The operator will automatically create the corresponding Pipeline CRDs. Here's an example of what the generated CRD will look like:

apiVersion: etl.glassflow.io/v1alpha1
kind: Pipeline
metadata:
  name: user-events-pipeline
spec:
  pipeline_id: "user-events-v1"
  config: "pipeline-config"
  dlq: "dead-letter-queue"
  sources:
    type: kafka
    topics:
      - topic_name: "user-events"
        stream: "users"
        dedup_window: 60000000000  # 1 minute in nanoseconds
  join:
    type: "temporal"
    stream: "joined-users"
    enabled: true
  sink: "clickhouse"

Current Capabilities

Feature Status Description
Pipeline Creation βœ… Deploy new ETL pipelines via CRD
Pipeline Termination βœ… Graceful shutdown and cleanup
Pipeline Pausing βœ… Temporarily halt data processing
Pipeline Resuming βœ… Resume paused pipelines
Deduplication βœ… Configurable time-window deduplication
Stream Joins βœ… Multi-stream data joining
Auto-scaling βœ… Horizontal pod autoscaling / ingestor replicas support
Monitoring βœ… Prometheus metrics integration
Helm Uninstall Cleanup βœ… Automatic pipeline termination and CRD cleanup on uninstall

πŸ› οΈ Development Setup

Prerequisites

  • Go 1.23+
  • Docker 17.03+
  • kubectl v1.11.3+
  • Kind (for local testing)
  • NATS (for messaging)

Local Development

  1. Clone and setup:

    git clone https://github.com/glassflow/glassflow-etl-k8s-operator.git
    cd glassflow-etl-k8s-operator
    make help  # See all available targets
  2. Install dependencies:

    # Install development tools
    make controller-gen
    make kustomize
    make golangci-lint
  3. Start local infrastructure:

    # Start NATS with JetStream (must run inside the cluster)
    helm repo add nats https://nats-io.github.io/k8s/helm/charts/
    helm install nats nats/nats --set nats.jetstream.enabled=true
    
    # Start Kafka (using Helm)
    helm repo add bitnami https://charts.bitnami.com/bitnami
    helm install kafka bitnami/kafka
    
    # Start ClickHouse (using Helm)
    helm install clickhouse bitnami/clickhouse
    
    # Or use external Kafka/ClickHouse for development
  4. Run the operator:

    # Run locally (requires NATS running inside the cluster)
    make run

Project Structure

This project was built using Kubebuilder v4 and follows Kubernetes operator best practices:

β”œβ”€β”€ api/v1alpha1/          # CRD definitions
β”œβ”€β”€ internal/controller/    # Operator controller logic
β”œβ”€β”€ internal/nats/         # NATS client integration
β”œβ”€β”€ charts/                # Helm charts
β”œβ”€β”€ config/                # Kustomize configurations
└── test/                  # Unit and e2e tests

Development Tools

  • Kubebuilder - Operator framework and scaffolding
  • Kustomize - Kubernetes configuration management
  • Helmify - Automatic Helm chart generation
  • GolangCI-Lint - Code quality and linting

Testing

# Run e2e tests (requires Kind cluster) - Primary testing method
make test-e2e

# Run unit tests (coverage being improved)
make test

# Run linter
make lint

πŸ“Š Chart Comparison

Chart Purpose Components Use Case
glassflow-etl Complete ETL Platform UI, API, Operator, NATS Full-featured deployment
glassflow-operator Operator Only Operator, CRDs Dependency for custom setups

The glassflow-etl chart includes the complete platform with web UI, backend API, NATS, and the operator as dependencies. The glassflow-operator chart is designed as a dependency for the main chart or custom deployments.

πŸ”— Related Projects

πŸŽ₯ Demo & Resources

🀝 Contributing

We welcome contributions!

Development Workflow

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: make test
  5. Run linter: make lint
  6. Submit a pull request

πŸ“„ License

This project is licensed under the Apache License 2.0 - see the clickhouse-etl LICENSE file for details.

πŸ†˜ Support


Built by GlassFlow Team

Website β€’ Documentation β€’ GitHub

About

K8s operator for ETL components orchestration

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors 6