Skip to content

A high-performance, configurable Go data processing pipeline library with support for batch processing and data deduplication.

License

Notifications You must be signed in to change notification settings

rushairer/go-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Go Pipeline v2

English | δΈ­ζ–‡ | πŸ“– Documentation

Go Tests Go Report Card GoDoc Latest Release License

A high-performance Go batch processing pipeline framework with generics support, concurrency safety, providing both standard batch processing and deduplication batch processing modes.

πŸ“‹ System Requirements

  • Go 1.18+ (with generics support)
  • Supports Linux, macOS, Windows

πŸ“¦ Installation

go get github.com/rushairer/go-pipeline/v2@latest

πŸš€ Features

  • Generics Support: Type-safe implementation based on Go 1.18+ generics
  • Batch Processing: Automatic batching by size and time intervals
  • Concurrency Safety: Built-in goroutine safety mechanisms
  • Flexible Configuration: Customizable buffer size, batch size, and flush intervals
  • Error Handling: Comprehensive error handling and propagation mechanisms
  • Two Modes: Standard batch processing and deduplication batch processing
  • Sync/Async: Support for both synchronous and asynchronous execution modes
  • Go Conventions: Follows "writer closes" channel management principle

πŸ“ Project Structure

v2/
β”œβ”€β”€ config.go                           # Configuration definitions
β”œβ”€β”€ errors.go                           # Error definitions
β”œβ”€β”€ interface.go                        # Interface definitions
β”œβ”€β”€ pipeline_impl.go                    # Common pipeline implementation
β”œβ”€β”€ pipeline_standard.go                # Standard pipeline implementation
β”œβ”€β”€ pipeline_deduplication.go           # Deduplication pipeline implementation
β”œβ”€β”€ pipeline_standard_test.go           # Standard pipeline unit tests
β”œβ”€β”€ pipeline_standard_benchmark_test.go # Standard pipeline benchmark tests
β”œβ”€β”€ pipeline_deduplication_test.go      # Deduplication pipeline unit tests
β”œβ”€β”€ pipeline_deduplication_benchmark_test.go # Deduplication pipeline benchmark tests
└── pipeline_performance_benchmark_test.go # Performance benchmark tests

πŸ“¦ Core Components

Interface Definitions

  • PipelineChannel[T]: Defines pipeline channel access interface
  • Performer: Defines pipeline execution interface
  • DataProcessor[T]: Defines core batch data processing interface
  • Pipeline[T]: Combines all pipeline functionality into a universal interface

Implementation Types

  • StandardPipeline[T]: Standard batch processing pipeline, processes data sequentially in batches
  • DeduplicationPipeline[T]: Deduplication batch processing pipeline, deduplicates based on unique keys
  • PipelineImpl[T]: Common pipeline implementation providing basic functionality

πŸ—οΈ Architecture Design

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Data Input    │───▢│   Buffer Channel │───▢│  Batch Processorβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚                        β”‚
                                β–Ό                        β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚   Timer Ticker   β”‚    β”‚   Flush Handler β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚                        β”‚
                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                         β–Ό
                                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                β”‚  Error Channel  β”‚
                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ Data Flow Diagrams

Standard Pipeline Flow

graph TD
    A[Data Input] --> B[Add to Buffer Channel]
    B --> C{Batch Full?}
    C -->|Yes| D[Execute Batch Processing]
    C -->|No| E[Wait for More Data]
    E --> F{Timer Triggered?}
    F -->|Yes| G{Batch Empty?}
    G -->|No| D
    G -->|Yes| E
    F -->|No| E
    D --> H[Call Flush Function]
    H --> I{Error Occurred?}
    I -->|Yes| J[Send to Error Channel]
    I -->|No| K[Reset Batch]
    J --> K
    K --> E
Loading

Test File Descriptions

The project includes a complete test suite to ensure code quality and performance:

  • pipeline_standard_test.go: Unit tests for standard pipeline, verifying basic functionality
  • pipeline_deduplication_test.go: Unit tests for deduplication pipeline, verifying deduplication logic
  • pipeline_standard_benchmark_test.go: Performance benchmark tests for standard pipeline
  • pipeline_deduplication_benchmark_test.go: Performance benchmark tests for deduplication pipeline
  • pipeline_performance_benchmark_test.go: Comprehensive performance benchmark tests

Deduplication Pipeline Flow

graph TD
    A[Data Input] --> B[Get Unique Key]
    B --> C[Add to Map Container]
    C --> D{Batch Full?}
    D -->|Yes| E[Execute Deduplication Batch Processing]
    D -->|No| F[Wait for More Data]
    F --> G{Timer Triggered?}
    G -->|Yes| H{Batch Empty?}
    H -->|No| E
    H -->|Yes| F
    G -->|No| F
    E --> I[Call Deduplication Flush Function]
    I --> J{Error Occurred?}
    J -->|Yes| K[Send to Error Channel]
    J -->|No| L[Reset Batch]
    K --> L
    L --> F
Loading

πŸ“‹ Configuration Parameters

type PipelineConfig struct {
    BufferSize       uint32        // Buffer channel capacity (default: 100)
    FlushSize        uint32        // Maximum batch data capacity (default: 50)
    FlushInterval    time.Duration // Timed flush interval (default: 50ms)
    DrainOnCancel    bool          // Whether to best-effort flush on cancellation (default false)
    DrainGracePeriod time.Duration // Max window for the final flush when DrainOnCancel is true
}

🎯 Performance-Optimized Default Values

Based on performance benchmark tests, v2 version adopts optimized default configuration:

  • BufferSize: 100 - Buffer size, should be >= FlushSize * 2 to avoid blocking
  • FlushSize: 50 - Batch size, performance tests show around 50 is optimal
  • FlushInterval: 50ms - Flush interval, balances latency and throughput

FlushSize vs BufferSize: Relationship and Tuning

  • Roles:
    • FlushSize: batch size threshold; reaching it triggers a flush (or by FlushInterval).
    • BufferSize: capacity of input channel; determines how much can be queued without blocking producers.
  • Recommended relation: BufferSize β‰₯ k Γ— FlushSize, where k ∈ [4, 10] for stable throughput under bursts.
  • Effects of size relation:
    • BufferSize < FlushSize: frequent timeout-based small batches, lower throughput, higher latency/GC.
    • BufferSize β‰ˆ 2Γ—FlushSize: generally OK, but sensitive to bursty producers.
    • BufferSize β‰₯ 4–10Γ—FlushSize: higher full-batch ratio, better throughput, fewer producer stalls (uses more memory).
  • Coordination with FlushInterval:
    • FlushInterval bounds tail latency when a batch isn't filled in time.
    • Too-small BufferSize shifts more flushes to timeout path, shrinking effective batch size.

Sizing recipe based on processing cost:

  • Measure in your flush function:
    • t_item: average per-item processing time (ns/item).
    • t_batch: fixed per-batch overhead (ns/batch), e.g., DB round-trip, encoding, etc.
  • Choose amortization target Ξ± (e.g., 0.1 means per-batch overhead per item ≀ 10% of per-item cost).
  • Then:
    • FlushSize β‰₯ ceil(t_batch / (Ξ± Γ— t_item)) // clamp to [32, 128] as a practical range; default 50
    • BufferSize = k Γ— FlushSize, with k in [4, 10] depending on burstiness and number of producers.
  • Example:
    • t_item = 2Β΅s, t_batch = 200Β΅s, Ξ± = 0.1 β‡’ FlushSize β‰₯ 200 / (0.1Γ—2) = 1000 β‡’ clamp to 128 if latency-sensitive, or keep 1000 if purely throughput-focused; then BufferSize = 4–10 Γ— FlushSize.

Recommended defaults (balanced latency/throughput):

  • FlushSize: 50
  • BufferSize: 100 (β‰ˆ 2Γ—FlushSize; increase to 4–10Γ— under multi-producer bursts)
  • FlushInterval: 50ms

Cheat Sheet

Quick picks:

  • High throughput:
    • FlushSize: 64–128 (default 50 is balanced; increase if pure throughput)
    • BufferSize: 4–10 Γ— FlushSize (higher with more producers/bursts)
    • FlushInterval: 50–100ms
  • Low latency:
    • FlushSize: 8–32
    • BufferSize: β‰₯ 4 Γ— FlushSize
    • FlushInterval: 1–10ms (cap tail latency)
  • Memory constrained:
    • FlushSize: 16–32
    • BufferSize: 2–4 Γ— FlushSize (bounded by memory budget)
    • FlushInterval: 50–200ms
  • Multiple producers (N):
    • Suggest: BufferSize β‰₯ (4–10) Γ— FlushSize Γ— ceil(N / NumCPU)
    • Goal: maintain high full-batch ratio and reduce producer stalls under bursts

Formulas:

  • FlushSize β‰ˆ clamp( ceil(t_batch / (Ξ± Γ— t_item)), 32, 128 )
    • t_item: avg per-item cost
    • t_batch: fixed per-batch overhead (e.g., DB round-trip)
    • Ξ±: amortization target (e.g., 0.1 β‡’ per-batch overhead per item ≀ 10% of t_item)
  • BufferSize = k Γ— FlushSize, k ∈ [4, 10] based on concurrency/burstiness
  • FlushInterval:
    • latency-bound: FlushInterval β‰ˆ target tail-latency budget
    • rate-bound: FlushInterval β‰ˆ p99 inter-arrival Γ— FlushSize

Validation checklist:

  • Full-batch ratio β‰₯ 80%
  • Producer stall rate near 0
  • Stable GC/memory watermark
  • P99/P99.9 E2E latency within SLO

Tuning Helper (Go)

Use two measurements (N1, N2) to estimate t_item and t_batch, then compute recommended FlushSize/BufferSize:

package main

import (
	"context"
	"fmt"
	"math"
	"time"

	gopipeline "github.com/rushairer/go-pipeline/v2"
)

// Replace this with your real batch function you're planning to use.
func batchFunc(ctx context.Context, items []int) error {
	// Simulate work: per-batch overhead + per-item cost.
	// Replace with actual logic. Keep it side-effect free for measurement.
	time.Sleep(200 * time.Microsecond) // t_batch (example)
	perItem := 2 * time.Microsecond    // t_item (example)
	time.Sleep(time.Duration(len(items)) * perItem)
	return nil
}

func measureOnce(n int, rounds int) time.Duration {
	items := make([]int, n)
	var total time.Duration
	ctx := context.Background()
	for i := 0; i < rounds; i++ {
		start := time.Now()
		_ = batchFunc(ctx, items)
		total += time.Since(start)
	}
	return total / time.Duration(rounds)
}

func estimateCosts(n1, n2, rounds int) (tItem, tBatch time.Duration) {
	d1 := measureOnce(n1, rounds)
	d2 := measureOnce(n2, rounds)
	// Linear fit: d = t_batch + n * t_item
	// t_item = (d2 - d1) / (n2 - n1)
	// t_batch = d1 - n1 * t_item
	tItem = time.Duration(int64((d2 - d1) / time.Duration(n2-n1)))
	tBatch = d1 - time.Duration(n1)*tItem
	if tItem < 0 {
		tItem = 0
	}
	if tBatch < 0 {
		tBatch = 0
	}
	return
}

func recommend(tItem, tBatch time.Duration, alpha float64, k int) (flush uint32, buffer uint32) {
	// FlushSize >= ceil(t_batch / (alpha * t_item)), clamped to [32, 128] as a practical default range
	if tItem <= 0 || alpha <= 0 {
		return 50, 100 // safe defaults
	}
	raw := float64(tBatch) / (alpha * float64(tItem))
	fs := int(math.Ceil(raw))
	if fs < 32 {
		fs = 32
	}
	if fs > 128 {
		// If you're purely throughput-focused, you may keep fs > 128.
		// For balanced latency, clamp to 128.
		fs = 128
	}
	if k < 1 {
		k = 4
	}
	return uint32(fs), uint32(k * fs)
}

func main() {
	// Example measurement with two points
	n1, n2 := 64, 512
	rounds := 20
	tItem, tBatch := estimateCosts(n1, n2, rounds)
	flush, buffer := recommend(tItem, tBatch, 0.1, 8)

	fmt.Printf("Estimated t_item=%v, t_batch=%v\n", tItem, tBatch)
	fmt.Printf("Recommended FlushSize=%d, BufferSize=%d (k=8, Ξ±=0.1)\n", flush, buffer)

	// Example of using recommended config
	_ = gopipeline.NewStandardPipeline[int](gopipeline.PipelineConfig{
		BufferSize:    buffer,
		FlushSize:     flush,
		FlushInterval: 50 * time.Millisecond,
	}, func(ctx context.Context, batch []int) error {
		return batchFunc(ctx, batch)
	})
}

Notes:

  • Replace batchFunc with your real processing. Keep external side effects minimal to reduce noise while measuring.
  • If your measured FlushSize exceeds 128 and you care about latency, clamp to 128; otherwise keep the larger value and increase BufferSize proportionally (k Γ— FlushSize).
  • Re-run measurements on different machines/workloads; cache, IO and network drastically affect t_batch.

Deduplication Pipeline Tuning

Key points:

  • Effective batch size: After dedup, the actual batch size ≀ FlushSize. If input has high duplication, the effective batch size may be much smaller than FlushSize.
  • Apply the same sizing recipe, but consider uniqueness ratio u ∈ (0,1]:
    • If pre-dedup batch has N items and u fraction are unique, effective items β‰ˆ u Γ— N.
    • When computing FlushSize by cost, you may need larger pre-dedup FlushSize so that u Γ— FlushSize β‰ˆ your target effective batch (e.g., ~50).
  • Buffer and interval:
    • BufferSize: still set BufferSize β‰₯ k Γ— FlushSize with k in [4,10] to absorb bursts.
    • FlushInterval: with high duplication, slightly increasing FlushInterval can help accumulate enough unique items to reach target effective batch; balance with latency SLO.
  • Memory note:
    • Dedup uses a map for the current batch. Map entries add overhead per unique key; prefer reusing value buffers in your flush function to reduce allocations.

Example with duplication:

  • Suppose t_item = 2Β΅s, t_batch = 200Β΅s, Ξ± = 0.1 β‡’ cost-based FlushSize_raw = 1000.
  • If uniqueness ratio u β‰ˆ 0.2, effective batch at FlushSize_raw β‰ˆ 200. If you want β‰ˆ 50 effective:
    • You can clamp FlushSize to 256–512 for latency balance, since u Γ— 256 β‰ˆ 51, or keep larger for throughput.
    • Set BufferSize = 8 Γ— FlushSize to handle bursts.

Multi-producer Sizing Example

For N producers on P logical CPUs:

  • Rule of thumb: BufferSize β‰₯ (4–10) Γ— FlushSize Γ— ceil(N / P).
  • Purpose: keep the consumer flushing with full batches while minimizing producer stalls during bursts.

Numerical example:

  • P=8 CPUs, N=16 producers, target FlushSize=64, choose k=6:
    • BufferSize β‰₯ 6 Γ— 64 Γ— ceil(16/8) = 6 Γ— 64 Γ— 2 = 768 (round to 1024 for headroom).
    • If uniqueness ratio u=0.5 in dedup mode and you need ~64 effective per flush, set FlushSizeβ‰ˆ128, then recompute BufferSize.

Cancellation (ctx.Done) Drain Options

Two optional knobs to balance correctness vs. immediacy on cancellation:

  • DrainOnCancel (bool, default: false):
    • false: cancel means immediate stop (no final flush)
    • true: on cancel, perform a best-effort final flush for the current partial batch within a bounded window
  • DrainGracePeriod (time.Duration):
    • Max time window for the best-effort flush when DrainOnCancel is true (default internal fallback: ~100ms if unset)

Recommended usage:

  • Normal shutdown (preserve data): close the data channel; the pipeline guarantees a final flush of remaining data and exits.
  • Forceful stop: cancel the context with DrainOnCancel=false.
  • Graceful cancel with minimal loss: set DrainOnCancel=true and configure a reasonable DrainGracePeriod (e.g., 50–200ms), noting the flush function should not ignore the new context.

Configuration with Default Values

You can use the NewPipelineConfig() function to create a configuration with default values, then customize specific parameters:

// Create configuration with default values
config := gopipeline.NewPipelineConfig()

// Use default values directly
pipeline := gopipeline.NewStandardPipeline(config, flushFunc)

// Or customize specific parameters using chain methods
config = gopipeline.NewPipelineConfig().
    WithFlushInterval(time.Millisecond * 10).
    WithBufferSize(200)

pipeline = gopipeline.NewStandardPipeline(config, flushFunc)

Available configuration methods:

  • NewPipelineConfig() - Create configuration with default values
  • WithBufferSize(size uint32) - Set buffer size
  • WithFlushSize(size uint32) - Set batch size
  • WithFlushInterval(interval time.Duration) - Set flush interval
  • WithDrainOnCancel(enabled bool) - Enable best-effort final flush on cancel
  • WithDrainGracePeriod(d time.Duration) - Set max window for the final flush when DrainOnCancel is enabled

πŸ’‘ Usage Examples

Standard Pipeline Example

package main

import (
    "context"
    "fmt"
    "log"
    "time"
    
    gopipeline "github.com/rushairer/go-pipeline/v2"
)

func main() {
    // Create standard pipeline
    pipeline := gopipeline.NewDefaultStandardPipeline(
        func(ctx context.Context, batchData []int) error {
            fmt.Printf("Processing batch data: %v\n", batchData)
            // Here you can perform database writes, API calls, etc.
            return nil
        },
    )
    
    ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
    defer cancel()
    
    // Start async processing
    go func() {
        if err := pipeline.AsyncPerform(ctx); err != nil {
            log.Printf("Pipeline execution error: %v", err)
        }
    }()
    
    // Listen for errors (must consume error channel)
    errorChan := pipeline.ErrorChan(10) // Specify error channel buffer size
    go func() {
        for {
            select {
            case err, ok := <-errorChan:
                if !ok {
                    return
                }
                log.Printf("Batch processing error: %v", err)
            case <-ctx.Done():
                return
            }
        }
    }()
    
    // Use new DataChan API to send data
    dataChan := pipeline.DataChan()
    go func() {
        defer close(dataChan) // User controls channel closure
        for i := 0; i < 100; i++ {
            select {
            case dataChan <- i:
            case <-ctx.Done():
                return
            }
        }
    }()
    
    time.Sleep(time.Second * 2) // Wait for processing to complete
}

Deduplication Pipeline Example

package main

import (
    "context"
    "fmt"
    "log"
    "time"
    
    gopipeline "github.com/rushairer/go-pipeline/v2"
)

// Data structure implementing UniqueKeyData interface
type User struct {
    ID   string
    Name string
}

func (u User) GetKey() string {
    return u.ID
}

func main() {
    // Create deduplication pipeline
    pipeline := gopipeline.NewDefaultDeduplicationPipeline(
        func(ctx context.Context, batchData map[string]User) error {
            fmt.Printf("Processing deduplicated user data: %d users\n", len(batchData))
            for key, user := range batchData {
                fmt.Printf("  %s: %s\n", key, user.Name)
            }
            return nil
        },
    )
    
    ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
    defer cancel()
    
    // Start async processing
    go func() {
        if err := pipeline.AsyncPerform(ctx); err != nil {
            log.Printf("Pipeline execution error: %v", err)
        }
    }()
    
    // Listen for errors
    errorChan := pipeline.ErrorChan(10)
    go func() {
        for {
            select {
            case err, ok := <-errorChan:
                if !ok {
                    return
                }
                log.Printf("Batch processing error: %v", err)
            case <-ctx.Done():
                return
            }
        }
    }()
    
    // Use new DataChan API to send data
    dataChan := pipeline.DataChan()
    go func() {
        defer close(dataChan)
        
        users := []User{
            {ID: "1", Name: "Alice"},
            {ID: "2", Name: "Bob"},
            {ID: "1", Name: "Alice Updated"}, // Will overwrite the first Alice
            {ID: "3", Name: "Charlie"},
            {ID: "2", Name: "Bob Updated"},   // Will overwrite the first Bob
        }
        
        for _, user := range users {
            select {
            case dataChan <- user:
            case <-ctx.Done():
                return
            }
        }
    }()
    
    time.Sleep(time.Second * 2) // Wait for processing to complete
}

Custom Configuration Example

// Create pipeline with custom configuration
config := gopipeline.PipelineConfig{
    BufferSize:    100,                   // Recommended balanced default
    FlushSize:     50,                    // Recommended balanced default
    FlushInterval: 50 * time.Millisecond, // Recommended balanced default
}

pipeline := gopipeline.NewStandardPipeline(config, 
    func(ctx context.Context, batchData []string) error {
        // Custom processing logic
        return nil
    },
)

Cancellation Drain Example

Two ways to finish a pipeline run:

  1. Close the data channel (recommended lossless drain)
config := gopipeline.NewPipelineConfig().
    WithBufferSize(100).
    WithFlushSize(50).
    WithFlushInterval(50 * time.Millisecond)
// DrainOnCancel is irrelevant here; closing the channel guarantees final flush.

p := gopipeline.NewStandardPipeline(config, func(ctx context.Context, batch []string) error {
    // Your processing
    return nil
})

ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

go func() { _ = p.AsyncPerform(ctx) }()

dataChan := p.DataChan()
go func() {
    defer close(dataChan) // writer closes: guarantees a final flush of remaining items
    for i := 0; i < 1000; i++ {
        select {
        case dataChan <- fmt.Sprintf("item-%d", i):
        case <-ctx.Done():
            return
        }
    }
}()
  1. Cancel via context, with best-effort drain on cancel
config := gopipeline.NewPipelineConfig().
    WithBufferSize(100).
    WithFlushSize(50).
    WithFlushInterval(50 * time.Millisecond).
    WithDrainOnCancel(true).                // enable drain on cancel
    WithDrainGracePeriod(150 * time.Millisecond) // bound the drain time window

p := gopipeline.NewStandardPipeline(config, func(ctx context.Context, batch []string) error {
    // IMPORTANT: respect ctx; return promptly when ctx.Done() to honor grace window
    return nil
})

ctx, cancel := context.WithCancel(context.Background())
go func() { _ = p.AsyncPerform(ctx) }()

dataChan := p.DataChan()
// send some data...
// When you need to stop quickly but still try to flush current partial batch:
cancel() // pipeline will do one best-effort flush within DrainGracePeriod, then exit

Notes:

  • Close-the-channel path ensures remaining data is flushed regardless of ctx cancellation.
  • Drain-on-cancel is a compromise for fast stop with minimal loss; choose a small DrainGracePeriod (e.g., 50–200ms) and ensure your flush respects the provided context.

Deduplication Cancellation Drain Example

Two ways to finish a deduplication pipeline run:

  1. Close the data channel (recommended lossless drain)
config := gopipeline.NewPipelineConfig().
    WithBufferSize(100).
    WithFlushSize(50).
    WithFlushInterval(50 * time.Millisecond)

p := gopipeline.NewDefaultDeduplicationPipeline(func(ctx context.Context, batch map[string]User) error {
    // Your deduped processing
    return nil
})

ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

go func() { _ = p.AsyncPerform(ctx) }()

ch := p.DataChan()
go func() {
    defer close(ch) // writer closes: guarantees a final flush (dedup map is flushed)
    for i := 0; i < 1000; i++ {
        select {
        case ch <- User{ID: fmt.Sprintf("%d", i%200), Name: "N"}: // include duplicates
        case <-ctx.Done():
            return
        }
    }
}()
  1. Cancel via context, with best-effort drain on cancel
config := gopipeline.NewPipelineConfig().
    WithBufferSize(100).
    WithFlushSize(50).
    WithFlushInterval(50 * time.Millisecond).
    WithDrainOnCancel(true).
    WithDrainGracePeriod(150 * time.Millisecond)

p := gopipeline.NewDefaultDeduplicationPipeline(func(ctx context.Context, batch map[string]User) error {
    // IMPORTANT: respect ctx; return promptly to honor the grace window
    return nil
})

ctx, cancel := context.WithCancel(context.Background())
go func() { _ = p.AsyncPerform(ctx) }()

ch := p.DataChan()
// send some data...
cancel() // pipeline performs a best-effort flush of current dedup map within DrainGracePeriod, then exits

Notes:

  • Dedup mode keeps a map for the current batch; both shutdown strategies ensure the remaining unique entries are flushed.
  • For high-duplication inputs, consider a slightly longer FlushInterval to accumulate enough unique items, balanced with your latency SLO.

Shutdown semantics

The pipeline can exit via two distinct paths:

  • Channel closed:
    • If the current batch is non-empty, a final synchronous flush is performed with context.Background().
    • The loop returns nil (graceful shutdown).
  • Context canceled:
    • DrainOnCancel = false: return ErrContextIsClosed (no final flush).
    • DrainOnCancel = true: perform one best-effort final synchronous flush under a separate drainCtx with timeout (DrainGracePeriod, default ~100ms if unset). Returns errors.Join(ErrContextIsClosed, ErrContextDrained).

Detect exit conditions via errors.Is:

err := pipeline.AsyncPerform(ctx)
// ...
if errors.Is(err, ErrContextIsClosed) {
    // Exited due to context cancellation
}
if errors.Is(err, ErrContextDrained) {
    // A best-effort final drain flush was performed on cancel
}
// On channel-close path, err == nil (graceful shutdown)

Notes:

  • The final drain flush is executed synchronously to avoid races on shutdown.
  • Your flush function should respect the provided context (drainCtx) and return promptly.

🎯 Use Cases

1. Database Batch Inserts

// Batch insert database records
pipeline := gopipeline.NewDefaultStandardPipeline(
    func(ctx context.Context, records []DatabaseRecord) error {
        return db.BatchInsert(ctx, records)
    },
)

2. Log Batch Processing

// Batch write log files
pipeline := gopipeline.NewDefaultStandardPipeline(
    func(ctx context.Context, logs []LogEntry) error {
        return logWriter.WriteBatch(logs)
    },
)

3. API Batch Calls

// Batch call third-party APIs
pipeline := gopipeline.NewDefaultStandardPipeline(
    func(ctx context.Context, requests []APIRequest) error {
        return apiClient.BatchCall(ctx, requests)
    },
)

4. User Data Deduplication

// User data deduplication processing
pipeline := gopipeline.NewDefaultDeduplicationPipeline(
    func(ctx context.Context, users map[string]User) error {
        return userService.BatchUpdate(ctx, users)
    },
)

5. Message Queue Batch Consumption

// Batch process message queue data
pipeline := gopipeline.NewDefaultStandardPipeline(
    func(ctx context.Context, messages []Message) error {
        return messageProcessor.ProcessBatch(ctx, messages)
    },
)

πŸ”₯ Advanced Usage

Dynamic Configuration Adjustment

// Dynamically adjust configuration based on system load
func createAdaptivePipeline() *gopipeline.StandardPipeline[Task] {
    config := gopipeline.PipelineConfig{
        BufferSize:    getOptimalBufferSize(),
        FlushSize:     getOptimalFlushSize(),
        FlushInterval: getOptimalInterval(),
    }
    
    return gopipeline.NewStandardPipeline(config, processTaskBatch)
}

func getOptimalBufferSize() uint32 {
    // Calculate based on system memory and CPU cores
    return uint32(runtime.NumCPU() * 50)
}

func getOptimalFlushSize() uint32 {
    // Based on performance tests, around 50 is optimal
    return 50
}

Error Retry Mechanism

pipeline := gopipeline.NewDefaultStandardPipeline(
    func(ctx context.Context, batchData []Task) error {
        return retryWithBackoff(ctx, func() error {
            return processBatch(batchData)
        }, 3, time.Second)
    },
)

func retryWithBackoff(ctx context.Context, fn func() error, maxRetries int, baseDelay time.Duration) error {
    for i := 0; i < maxRetries; i++ {
        if err := fn(); err == nil {
            return nil
        }
        
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(baseDelay * time.Duration(1<<i)):
            // Exponential backoff
        }
    }
    return fmt.Errorf("max retries exceeded")
}

Monitoring and Metrics Collection

type MetricsPipeline struct {
    *gopipeline.StandardPipeline[Event]
    processedCount int64
    errorCount     int64
}

func NewMetricsPipeline() *MetricsPipeline {
    mp := &MetricsPipeline{}
    
    mp.StandardPipeline = gopipeline.NewDefaultStandardPipeline(
        func(ctx context.Context, events []Event) error {
            err := processEvents(events)
            
            atomic.AddInt64(&mp.processedCount, int64(len(events)))
            if err != nil {
                atomic.AddInt64(&mp.errorCount, 1)
            }
            
            return err
        },
    )
    
    return mp
}

func (mp *MetricsPipeline) GetMetrics() (processed, errors int64) {
    return atomic.LoadInt64(&mp.processedCount), atomic.LoadInt64(&mp.errorCount)
}

Graceful Shutdown

func gracefulShutdown(pipeline *gopipeline.StandardPipeline[Task]) {
    // Create context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    
    // Stop accepting new data
    // Close data channel
    dataChan := pipeline.DataChan()
    close(dataChan)
    
    // Wait for processing to complete
    done := make(chan struct{})
    go func() {
        defer close(done)
        // Wait for error channel to close, indicating all data has been processed
        errorChan := pipeline.ErrorChan(10)
        for {
            select {
            case err, ok := <-errorChan:
                if !ok {
                    return
                }
                log.Printf("Processing remaining error: %v", err)
            case <-ctx.Done():
                return
            }
        }
    }()
    
    // Wait for completion or timeout
    select {
    case <-done:
        log.Println("Pipeline graceful shutdown completed")
    case <-ctx.Done():
        log.Println("Pipeline shutdown timeout")
    }
}

⚑ Performance Characteristics

Based on the latest performance benchmark test results:

πŸš€ Core Performance Metrics

  • Data Processing Throughput: ~248 nanoseconds/item (Apple M4)
  • Memory Efficiency: 232 bytes/operation, 7 allocations/operation
  • Batch Processing Optimization: 5x performance improvement from batch size 1 to 50
  • Pipeline Overhead: About 38% slower than direct processing (225.4 vs 162.7 ns/op)

πŸ“Š Batch Size Performance Comparison

BatchSize1:   740.5 ns/op  (Slowest - frequent flushing)
BatchSize10:  251.5 ns/op  (Significant improvement)
BatchSize50:  146.5 ns/op  (Optimal performance) ⭐
BatchSize100: 163.4 ns/op  (Slight decline)
BatchSize500: 198.6 ns/op  (Batch too large)

πŸ’‘ Performance Optimization Recommendations

  1. Optimal Batch Size: Around 50
  2. Buffer Configuration: BufferSize >= FlushSize * 2
  3. Flush Interval: 50ms balances latency and throughput
  4. Async Mode: Recommended for better performance

⚠️ Important Notes

Error Channel Behavior: Lazily initialized via sync.Once. The first call to ErrorChan(size int) decides the buffer size; subsequent calls ignore size. Even if you don't explicitly call it, the pipeline will initialize it on first error send and write errors non-blockingly. If the channel isn't consumed and the buffer fills, subsequent errors are dropped (no blocking or panic).

Recommended to Listen to Error Channel: If you call ErrorChan(size int), it's recommended to listen to the error channel and use select statements to avoid infinite waiting.

Channel Management: v2 version follows the "writer closes" principle, users need to control the closing timing of DataChan().

⚠️ Pipeline Reuse Warning: If you need to reuse the same pipeline instance for multiple runs (calling SyncPerform() or AsyncPerform() multiple times), DO NOT close the DataChan prematurely. DataChan() returns the same channel instance, and once closed, it cannot be reused. Use context cancellation or timeout to control pipeline lifecycle instead.

πŸ”§ Best Practices

  1. Reasonable Batch Size: Based on performance tests, recommend using batch size around 50
  2. ⚠️ Must Listen to Error Channel: Use select statements to avoid blocking, handle errors from batch processing promptly
  3. Proper Channel Closure: Use defer close(dataChan) to ensure proper channel closure
  4. Context Management: Use context to control pipeline lifecycle
  5. Deduplication Key Design: Ensure uniqueness and stability of deduplication keys
  6. Performance Tuning: Choose appropriate configuration parameters based on benchmark test results
  7. ⚠️ Pipeline Reuse: For repeated pipeline usage, avoid closing DataChan prematurely. Use context timeout/cancellation instead of channel closure to end processing

Pipeline Reuse Pattern

When you need to run the same pipeline multiple times:

// βœ… Correct: Use context to control lifecycle
pipeline := gopipeline.NewStandardPipeline(config, batchFunc)
dataChan := pipeline.DataChan() // Get channel once

// First run
ctx1, cancel1 := context.WithTimeout(context.Background(), time.Second*30)
go pipeline.SyncPerform(ctx1)
// Send data without closing channel
for _, data := range firstBatch {
    select {
    case dataChan <- data:
    case <-ctx1.Done():
        break
    }
}
cancel1() // End first run

// Second run - reuse same pipeline and channel
ctx2, cancel2 := context.WithTimeout(context.Background(), time.Second*30)
go pipeline.SyncPerform(ctx2)
// Send data again without closing channel
for _, data := range secondBatch {
    select {
    case dataChan <- data:
    case <-ctx2.Done():
        break
    }
}
cancel2() // End second run

// ❌ Wrong: Closing channel prevents reuse
// close(dataChan) // Don't do this if you plan to reuse!

πŸ“Š Error Handling

The framework provides comprehensive error handling mechanisms:

  • ErrContextIsClosed: Context is closed
  • ErrPerformLoopError: Execution loop error
  • ErrChannelIsClosed: Channel is closed

Error Channel Mechanism

v2 version provides a robust error handling mechanism with lazy initialization and non-blocking semantics:

πŸ›‘οΈ Safety Mechanisms

  • First-call Decides Size: ErrorChan(size int) uses sync.Once; the first call decides buffer size, later calls ignore size. If never called explicitly, a default buffer size is used on first internal send.
  • Optional Consumption: Listening to the error channel is optional; if unconsumed and the buffer fills, subsequent errors are dropped to avoid blocking.
  • Non-blocking Send: Errors are sent non-blockingly, ensuring the pipeline isn't blocked.
  • Buffer Full Handling: When the buffer is full, new errors are discarded instead of blocking; no panic occurs.

πŸ“‹ Usage Methods

Method 1: Listen to Errors (Recommended)

// Create error channel and listen
errorChan := pipeline.ErrorChan(10) // Specify buffer size
go func() {
    for {
        select {
        case err, ok := <-errorChan:
            if !ok {
                return // Channel closed
            }
            log.Printf("Processing error: %v", err)
            // Handle according to error type
        case <-ctx.Done():
            return // Context cancelled
        }
    }
}()

Method 2: Run Without Consuming Errors (Simplified)

// You may choose not to consume the error channel.
// The pipeline initializes the error channel on demand and sends errors non-blockingly.
// If the buffer fills and nobody consumes, subsequent errors are dropped (no blocking/panic).
pipeline := gopipeline.NewStandardPipeline(config, flushFunc)
go pipeline.AsyncPerform(ctx)

⚑ Error Handling Performance

  • Near-zero Overhead: Error channel is initialized once on demand; sends are non-blocking and lightweight.
  • Async Processing: Error sending runs independently, minimizing impact on the main flow.
  • Smart Discard: When the buffer is full and unconsumed, subsequent errors are dropped, preventing blocking.

πŸ§ͺ Testing

The project includes complete unit tests and benchmark tests:

# Run all tests
go test ./...

# Run unit tests
go test -v ./... -run Test

# Run benchmark tests
go test -bench=. ./...

# Run standard pipeline benchmark tests
go test -bench=BenchmarkStandardPipeline ./...

# Run deduplication pipeline benchmark tests
go test -bench=BenchmarkDeduplicationPipeline ./...

# Run performance benchmark tests
go test -bench=BenchmarkPipelineDataProcessing ./...

# Run batch efficiency tests
go test -bench=BenchmarkPipelineBatchSizes ./...

# Run memory usage tests
go test -bench=BenchmarkPipelineMemoryUsage ./...

πŸ“ˆ Performance Benchmarks

Latest benchmark test results on Apple M4 processor:

Core Performance Tests

BenchmarkPipelineDataProcessing-10                1000    248.2 ns/op    232 B/op    7 allocs/op
BenchmarkPipelineVsDirectProcessing/Pipeline-10   1000    225.4 ns/op
BenchmarkPipelineVsDirectProcessing/Direct-10     1000    162.7 ns/op
BenchmarkPipelineMemoryUsage-10                   1000    232.2 ns/op    510 B/op    9 allocs/op

Batch Size Efficiency Tests

BenchmarkPipelineBatchSizes/BatchSize1-10         500     740.5 ns/op    500.0 items_processed
BenchmarkPipelineBatchSizes/BatchSize10-10        500     251.5 ns/op    500.0 items_processed
BenchmarkPipelineBatchSizes/BatchSize50-10        500     146.5 ns/op    500.0 items_processed ⭐
BenchmarkPipelineBatchSizes/BatchSize100-10       500     163.4 ns/op    500.0 items_processed
BenchmarkPipelineBatchSizes/BatchSize500-10       500     198.6 ns/op    500.0 items_processed

Performance Analysis

  • Optimal Batch Size: Around 50, 5x performance improvement
  • Pipeline Overhead: About 38%, in exchange for better architecture and maintainability
  • Memory Efficiency: About 232-510 bytes memory usage per data item
  • Processing Capacity: Can process millions of records per second

Deduplication Pipeline Performance Characteristics

Deduplication pipeline adds the following performance characteristics on top of standard pipeline:

  • Memory Usage: Uses map structure to store data, slightly higher memory usage than standard pipeline
  • Processing Latency: Deduplication logic adds about 10-15% processing time
  • Key Generation Overhead: Need to generate unique keys for each data item
  • Batch Efficiency: Batch size after deduplication may be smaller than configured FlushSize

Performance Comparison:

  • Standard Pipeline: ~225 ns/op
  • Deduplication Pipeline: ~260 ns/op (about 15% overhead increase)

❓ Frequently Asked Questions (FAQ)

Q: How to choose appropriate configuration parameters?

A: Configuration recommendations based on performance tests:

  • High Throughput Scenario: FlushSize=50, BufferSize=100, FlushInterval=50ms
  • Low Latency Scenario: FlushSize=10, BufferSize=50, FlushInterval=10ms
  • Memory Constrained Scenario: FlushSize=20, BufferSize=40, FlushInterval=100ms
  • CPU Intensive Processing: Use async mode, appropriately increase buffer size

Q: What are the main differences between v2 and v1?

A: Important improvements in v2:

  1. Removed Add() Method: Changed to DataChan() API, follows "writer closes" principle
  2. Error Channel Improvement: ErrorChan(size int) uses lazy init; the first call decides the buffer size (later calls ignore size). If never called, a default size is used internally on first send.
  3. Performance Optimization: Default configuration optimized based on benchmark tests
  4. Better Lifecycle Management: Users control data channel closing timing

Q: Why remove the Add() method?

A:

  • Violates Go Principles: Add() method violates Go's "writer closes" principle
  • Better Control: DataChan() gives users complete control over data sending and channel closing
  • More Conventional: This is the standard Go channel usage pattern

Q: How to migrate from v1 to v2?

A: Migration steps:

// v1 approach
pipeline.Add(ctx, data)

// v2 approach
dataChan := pipeline.DataChan()
go func() {
    defer close(dataChan)
    for _, data := range dataList {
        select {
        case dataChan <- data:
        case <-ctx.Done():
            return
        }
    }
}()

Q: How to handle panic in batch processing functions?

A: The framework internally handles panic, but it's recommended to add recover in batch processing functions:

func(ctx context.Context, batchData []Task) error {
    defer func() {
        if r := recover(); r != nil {
            log.Printf("Batch processing panic: %v", r)
        }
    }()
    
    // Processing logic
    return nil
}

πŸ”§ Troubleshooting

Memory Leaks

Symptoms: Memory usage continuously growing Causes:

  • Error channel not being consumed
  • Data channel not properly closed
  • Memory leaks in batch processing functions

Solutions:

// Ensure error channel is consumed
errorChan := pipeline.ErrorChan(10)
go func() {
    for {
        select {
        case err, ok := <-errorChan:
            if !ok {
                return
            }
            // Handle error
        case <-ctx.Done():
            return
        }
    }
}()

// Ensure data channel is closed
dataChan := pipeline.DataChan()
defer close(dataChan)

Performance Issues

Symptoms: Processing speed slower than expected Troubleshooting Steps:

  1. Check if batch size is around 50
  2. Ensure BufferSize >= FlushSize * 2
  3. Use async mode
  4. Check batch processing function execution time

Optimization Recommendations:

// Use performance-optimized configuration
config := gopipeline.PipelineConfig{
    BufferSize:    100,                   // >= FlushSize * 2
    FlushSize:     50,                    // Optimal batch size
    FlushInterval: time.Millisecond * 50, // Balance latency and throughput
}

Data Loss

Symptoms: Some data not being processed Causes:

  • Context cancelled too early
  • Data channel closed too early
  • Batch processing function returns error but not handled

Solutions:

// Use sufficient timeout
ctx, cancel := context.WithTimeout(context.Background(), time.Minute*5)
defer cancel()

// Ensure all data is sent before closing channel
dataChan := pipeline.DataChan()
go func() {
    defer close(dataChan) // Close after all data is sent
    for _, data := range allData {
        select {
        case dataChan <- data:
        case <-ctx.Done():
            return
        }
    }
}()

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

Welcome to submit Issues and Pull Requests to improve this project!

About

A high-performance, configurable Go data processing pipeline library with support for batch processing and data deduplication.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •