English | δΈζ | π Documentation
A high-performance Go batch processing pipeline framework with generics support, concurrency safety, providing both standard batch processing and deduplication batch processing modes.
- Go 1.18+ (with generics support)
- Supports Linux, macOS, Windows
go get github.com/rushairer/go-pipeline/v2@latest
- Generics Support: Type-safe implementation based on Go 1.18+ generics
- Batch Processing: Automatic batching by size and time intervals
- Concurrency Safety: Built-in goroutine safety mechanisms
- Flexible Configuration: Customizable buffer size, batch size, and flush intervals
- Error Handling: Comprehensive error handling and propagation mechanisms
- Two Modes: Standard batch processing and deduplication batch processing
- Sync/Async: Support for both synchronous and asynchronous execution modes
- Go Conventions: Follows "writer closes" channel management principle
v2/
βββ config.go # Configuration definitions
βββ errors.go # Error definitions
βββ interface.go # Interface definitions
βββ pipeline_impl.go # Common pipeline implementation
βββ pipeline_standard.go # Standard pipeline implementation
βββ pipeline_deduplication.go # Deduplication pipeline implementation
βββ pipeline_standard_test.go # Standard pipeline unit tests
βββ pipeline_standard_benchmark_test.go # Standard pipeline benchmark tests
βββ pipeline_deduplication_test.go # Deduplication pipeline unit tests
βββ pipeline_deduplication_benchmark_test.go # Deduplication pipeline benchmark tests
βββ pipeline_performance_benchmark_test.go # Performance benchmark tests
PipelineChannel[T]
: Defines pipeline channel access interfacePerformer
: Defines pipeline execution interfaceDataProcessor[T]
: Defines core batch data processing interfacePipeline[T]
: Combines all pipeline functionality into a universal interface
StandardPipeline[T]
: Standard batch processing pipeline, processes data sequentially in batchesDeduplicationPipeline[T]
: Deduplication batch processing pipeline, deduplicates based on unique keysPipelineImpl[T]
: Common pipeline implementation providing basic functionality
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Data Input βββββΆβ Buffer Channel βββββΆβ Batch Processorβ
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββ βββββββββββββββββββ
β Timer Ticker β β Flush Handler β
ββββββββββββββββββββ βββββββββββββββββββ
β β
ββββββββββ¬ββββββββββββββββ
βΌ
βββββββββββββββββββ
β Error Channel β
βββββββββββββββββββ
graph TD
A[Data Input] --> B[Add to Buffer Channel]
B --> C{Batch Full?}
C -->|Yes| D[Execute Batch Processing]
C -->|No| E[Wait for More Data]
E --> F{Timer Triggered?}
F -->|Yes| G{Batch Empty?}
G -->|No| D
G -->|Yes| E
F -->|No| E
D --> H[Call Flush Function]
H --> I{Error Occurred?}
I -->|Yes| J[Send to Error Channel]
I -->|No| K[Reset Batch]
J --> K
K --> E
The project includes a complete test suite to ensure code quality and performance:
pipeline_standard_test.go
: Unit tests for standard pipeline, verifying basic functionalitypipeline_deduplication_test.go
: Unit tests for deduplication pipeline, verifying deduplication logicpipeline_standard_benchmark_test.go
: Performance benchmark tests for standard pipelinepipeline_deduplication_benchmark_test.go
: Performance benchmark tests for deduplication pipelinepipeline_performance_benchmark_test.go
: Comprehensive performance benchmark tests
graph TD
A[Data Input] --> B[Get Unique Key]
B --> C[Add to Map Container]
C --> D{Batch Full?}
D -->|Yes| E[Execute Deduplication Batch Processing]
D -->|No| F[Wait for More Data]
F --> G{Timer Triggered?}
G -->|Yes| H{Batch Empty?}
H -->|No| E
H -->|Yes| F
G -->|No| F
E --> I[Call Deduplication Flush Function]
I --> J{Error Occurred?}
J -->|Yes| K[Send to Error Channel]
J -->|No| L[Reset Batch]
K --> L
L --> F
type PipelineConfig struct {
BufferSize uint32 // Buffer channel capacity (default: 100)
FlushSize uint32 // Maximum batch data capacity (default: 50)
FlushInterval time.Duration // Timed flush interval (default: 50ms)
DrainOnCancel bool // Whether to best-effort flush on cancellation (default false)
DrainGracePeriod time.Duration // Max window for the final flush when DrainOnCancel is true
}
Based on performance benchmark tests, v2 version adopts optimized default configuration:
- BufferSize: 100 - Buffer size, should be >= FlushSize * 2 to avoid blocking
- FlushSize: 50 - Batch size, performance tests show around 50 is optimal
- FlushInterval: 50ms - Flush interval, balances latency and throughput
- Roles:
- FlushSize: batch size threshold; reaching it triggers a flush (or by FlushInterval).
- BufferSize: capacity of input channel; determines how much can be queued without blocking producers.
- Recommended relation: BufferSize β₯ k Γ FlushSize, where k β [4, 10] for stable throughput under bursts.
- Effects of size relation:
- BufferSize < FlushSize: frequent timeout-based small batches, lower throughput, higher latency/GC.
- BufferSize β 2ΓFlushSize: generally OK, but sensitive to bursty producers.
- BufferSize β₯ 4β10ΓFlushSize: higher full-batch ratio, better throughput, fewer producer stalls (uses more memory).
- Coordination with FlushInterval:
- FlushInterval bounds tail latency when a batch isn't filled in time.
- Too-small BufferSize shifts more flushes to timeout path, shrinking effective batch size.
Sizing recipe based on processing cost:
- Measure in your flush function:
- t_item: average per-item processing time (ns/item).
- t_batch: fixed per-batch overhead (ns/batch), e.g., DB round-trip, encoding, etc.
- Choose amortization target Ξ± (e.g., 0.1 means per-batch overhead per item β€ 10% of per-item cost).
- Then:
- FlushSize β₯ ceil(t_batch / (Ξ± Γ t_item)) // clamp to [32, 128] as a practical range; default 50
- BufferSize = k Γ FlushSize, with k in [4, 10] depending on burstiness and number of producers.
- Example:
- t_item = 2Β΅s, t_batch = 200Β΅s, Ξ± = 0.1 β FlushSize β₯ 200 / (0.1Γ2) = 1000 β clamp to 128 if latency-sensitive, or keep 1000 if purely throughput-focused; then BufferSize = 4β10 Γ FlushSize.
Recommended defaults (balanced latency/throughput):
- FlushSize: 50
- BufferSize: 100 (β 2ΓFlushSize; increase to 4β10Γ under multi-producer bursts)
- FlushInterval: 50ms
Quick picks:
- High throughput:
- FlushSize: 64β128 (default 50 is balanced; increase if pure throughput)
- BufferSize: 4β10 Γ FlushSize (higher with more producers/bursts)
- FlushInterval: 50β100ms
- Low latency:
- FlushSize: 8β32
- BufferSize: β₯ 4 Γ FlushSize
- FlushInterval: 1β10ms (cap tail latency)
- Memory constrained:
- FlushSize: 16β32
- BufferSize: 2β4 Γ FlushSize (bounded by memory budget)
- FlushInterval: 50β200ms
- Multiple producers (N):
- Suggest: BufferSize β₯ (4β10) Γ FlushSize Γ ceil(N / NumCPU)
- Goal: maintain high full-batch ratio and reduce producer stalls under bursts
Formulas:
- FlushSize β clamp( ceil(t_batch / (Ξ± Γ t_item)), 32, 128 )
- t_item: avg per-item cost
- t_batch: fixed per-batch overhead (e.g., DB round-trip)
- Ξ±: amortization target (e.g., 0.1 β per-batch overhead per item β€ 10% of t_item)
- BufferSize = k Γ FlushSize, k β [4, 10] based on concurrency/burstiness
- FlushInterval:
- latency-bound: FlushInterval β target tail-latency budget
- rate-bound: FlushInterval β p99 inter-arrival Γ FlushSize
Validation checklist:
- Full-batch ratio β₯ 80%
- Producer stall rate near 0
- Stable GC/memory watermark
- P99/P99.9 E2E latency within SLO
Use two measurements (N1, N2) to estimate t_item and t_batch, then compute recommended FlushSize/BufferSize:
package main
import (
"context"
"fmt"
"math"
"time"
gopipeline "github.com/rushairer/go-pipeline/v2"
)
// Replace this with your real batch function you're planning to use.
func batchFunc(ctx context.Context, items []int) error {
// Simulate work: per-batch overhead + per-item cost.
// Replace with actual logic. Keep it side-effect free for measurement.
time.Sleep(200 * time.Microsecond) // t_batch (example)
perItem := 2 * time.Microsecond // t_item (example)
time.Sleep(time.Duration(len(items)) * perItem)
return nil
}
func measureOnce(n int, rounds int) time.Duration {
items := make([]int, n)
var total time.Duration
ctx := context.Background()
for i := 0; i < rounds; i++ {
start := time.Now()
_ = batchFunc(ctx, items)
total += time.Since(start)
}
return total / time.Duration(rounds)
}
func estimateCosts(n1, n2, rounds int) (tItem, tBatch time.Duration) {
d1 := measureOnce(n1, rounds)
d2 := measureOnce(n2, rounds)
// Linear fit: d = t_batch + n * t_item
// t_item = (d2 - d1) / (n2 - n1)
// t_batch = d1 - n1 * t_item
tItem = time.Duration(int64((d2 - d1) / time.Duration(n2-n1)))
tBatch = d1 - time.Duration(n1)*tItem
if tItem < 0 {
tItem = 0
}
if tBatch < 0 {
tBatch = 0
}
return
}
func recommend(tItem, tBatch time.Duration, alpha float64, k int) (flush uint32, buffer uint32) {
// FlushSize >= ceil(t_batch / (alpha * t_item)), clamped to [32, 128] as a practical default range
if tItem <= 0 || alpha <= 0 {
return 50, 100 // safe defaults
}
raw := float64(tBatch) / (alpha * float64(tItem))
fs := int(math.Ceil(raw))
if fs < 32 {
fs = 32
}
if fs > 128 {
// If you're purely throughput-focused, you may keep fs > 128.
// For balanced latency, clamp to 128.
fs = 128
}
if k < 1 {
k = 4
}
return uint32(fs), uint32(k * fs)
}
func main() {
// Example measurement with two points
n1, n2 := 64, 512
rounds := 20
tItem, tBatch := estimateCosts(n1, n2, rounds)
flush, buffer := recommend(tItem, tBatch, 0.1, 8)
fmt.Printf("Estimated t_item=%v, t_batch=%v\n", tItem, tBatch)
fmt.Printf("Recommended FlushSize=%d, BufferSize=%d (k=8, Ξ±=0.1)\n", flush, buffer)
// Example of using recommended config
_ = gopipeline.NewStandardPipeline[int](gopipeline.PipelineConfig{
BufferSize: buffer,
FlushSize: flush,
FlushInterval: 50 * time.Millisecond,
}, func(ctx context.Context, batch []int) error {
return batchFunc(ctx, batch)
})
}
Notes:
- Replace batchFunc with your real processing. Keep external side effects minimal to reduce noise while measuring.
- If your measured FlushSize exceeds 128 and you care about latency, clamp to 128; otherwise keep the larger value and increase BufferSize proportionally (k Γ FlushSize).
- Re-run measurements on different machines/workloads; cache, IO and network drastically affect t_batch.
Key points:
- Effective batch size: After dedup, the actual batch size β€ FlushSize. If input has high duplication, the effective batch size may be much smaller than FlushSize.
- Apply the same sizing recipe, but consider uniqueness ratio u β (0,1]:
- If pre-dedup batch has N items and u fraction are unique, effective items β u Γ N.
- When computing FlushSize by cost, you may need larger pre-dedup FlushSize so that u Γ FlushSize β your target effective batch (e.g., ~50).
- Buffer and interval:
- BufferSize: still set BufferSize β₯ k Γ FlushSize with k in [4,10] to absorb bursts.
- FlushInterval: with high duplication, slightly increasing FlushInterval can help accumulate enough unique items to reach target effective batch; balance with latency SLO.
- Memory note:
- Dedup uses a map for the current batch. Map entries add overhead per unique key; prefer reusing value buffers in your flush function to reduce allocations.
Example with duplication:
- Suppose t_item = 2Β΅s, t_batch = 200Β΅s, Ξ± = 0.1 β cost-based FlushSize_raw = 1000.
- If uniqueness ratio u β 0.2, effective batch at FlushSize_raw β 200. If you want β 50 effective:
- You can clamp FlushSize to 256β512 for latency balance, since u Γ 256 β 51, or keep larger for throughput.
- Set BufferSize = 8 Γ FlushSize to handle bursts.
For N producers on P logical CPUs:
- Rule of thumb: BufferSize β₯ (4β10) Γ FlushSize Γ ceil(N / P).
- Purpose: keep the consumer flushing with full batches while minimizing producer stalls during bursts.
Numerical example:
- P=8 CPUs, N=16 producers, target FlushSize=64, choose k=6:
- BufferSize β₯ 6 Γ 64 Γ ceil(16/8) = 6 Γ 64 Γ 2 = 768 (round to 1024 for headroom).
- If uniqueness ratio u=0.5 in dedup mode and you need ~64 effective per flush, set FlushSizeβ128, then recompute BufferSize.
Two optional knobs to balance correctness vs. immediacy on cancellation:
- DrainOnCancel (bool, default: false):
- false: cancel means immediate stop (no final flush)
- true: on cancel, perform a best-effort final flush for the current partial batch within a bounded window
- DrainGracePeriod (time.Duration):
- Max time window for the best-effort flush when DrainOnCancel is true (default internal fallback: ~100ms if unset)
Recommended usage:
- Normal shutdown (preserve data): close the data channel; the pipeline guarantees a final flush of remaining data and exits.
- Forceful stop: cancel the context with DrainOnCancel=false.
- Graceful cancel with minimal loss: set DrainOnCancel=true and configure a reasonable DrainGracePeriod (e.g., 50β200ms), noting the flush function should not ignore the new context.
You can use the NewPipelineConfig()
function to create a configuration with default values, then customize specific parameters:
// Create configuration with default values
config := gopipeline.NewPipelineConfig()
// Use default values directly
pipeline := gopipeline.NewStandardPipeline(config, flushFunc)
// Or customize specific parameters using chain methods
config = gopipeline.NewPipelineConfig().
WithFlushInterval(time.Millisecond * 10).
WithBufferSize(200)
pipeline = gopipeline.NewStandardPipeline(config, flushFunc)
Available configuration methods:
NewPipelineConfig()
- Create configuration with default valuesWithBufferSize(size uint32)
- Set buffer sizeWithFlushSize(size uint32)
- Set batch sizeWithFlushInterval(interval time.Duration)
- Set flush intervalWithDrainOnCancel(enabled bool)
- Enable best-effort final flush on cancelWithDrainGracePeriod(d time.Duration)
- Set max window for the final flush when DrainOnCancel is enabled
package main
import (
"context"
"fmt"
"log"
"time"
gopipeline "github.com/rushairer/go-pipeline/v2"
)
func main() {
// Create standard pipeline
pipeline := gopipeline.NewDefaultStandardPipeline(
func(ctx context.Context, batchData []int) error {
fmt.Printf("Processing batch data: %v\n", batchData)
// Here you can perform database writes, API calls, etc.
return nil
},
)
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
defer cancel()
// Start async processing
go func() {
if err := pipeline.AsyncPerform(ctx); err != nil {
log.Printf("Pipeline execution error: %v", err)
}
}()
// Listen for errors (must consume error channel)
errorChan := pipeline.ErrorChan(10) // Specify error channel buffer size
go func() {
for {
select {
case err, ok := <-errorChan:
if !ok {
return
}
log.Printf("Batch processing error: %v", err)
case <-ctx.Done():
return
}
}
}()
// Use new DataChan API to send data
dataChan := pipeline.DataChan()
go func() {
defer close(dataChan) // User controls channel closure
for i := 0; i < 100; i++ {
select {
case dataChan <- i:
case <-ctx.Done():
return
}
}
}()
time.Sleep(time.Second * 2) // Wait for processing to complete
}
package main
import (
"context"
"fmt"
"log"
"time"
gopipeline "github.com/rushairer/go-pipeline/v2"
)
// Data structure implementing UniqueKeyData interface
type User struct {
ID string
Name string
}
func (u User) GetKey() string {
return u.ID
}
func main() {
// Create deduplication pipeline
pipeline := gopipeline.NewDefaultDeduplicationPipeline(
func(ctx context.Context, batchData map[string]User) error {
fmt.Printf("Processing deduplicated user data: %d users\n", len(batchData))
for key, user := range batchData {
fmt.Printf(" %s: %s\n", key, user.Name)
}
return nil
},
)
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
defer cancel()
// Start async processing
go func() {
if err := pipeline.AsyncPerform(ctx); err != nil {
log.Printf("Pipeline execution error: %v", err)
}
}()
// Listen for errors
errorChan := pipeline.ErrorChan(10)
go func() {
for {
select {
case err, ok := <-errorChan:
if !ok {
return
}
log.Printf("Batch processing error: %v", err)
case <-ctx.Done():
return
}
}
}()
// Use new DataChan API to send data
dataChan := pipeline.DataChan()
go func() {
defer close(dataChan)
users := []User{
{ID: "1", Name: "Alice"},
{ID: "2", Name: "Bob"},
{ID: "1", Name: "Alice Updated"}, // Will overwrite the first Alice
{ID: "3", Name: "Charlie"},
{ID: "2", Name: "Bob Updated"}, // Will overwrite the first Bob
}
for _, user := range users {
select {
case dataChan <- user:
case <-ctx.Done():
return
}
}
}()
time.Sleep(time.Second * 2) // Wait for processing to complete
}
// Create pipeline with custom configuration
config := gopipeline.PipelineConfig{
BufferSize: 100, // Recommended balanced default
FlushSize: 50, // Recommended balanced default
FlushInterval: 50 * time.Millisecond, // Recommended balanced default
}
pipeline := gopipeline.NewStandardPipeline(config,
func(ctx context.Context, batchData []string) error {
// Custom processing logic
return nil
},
)
Two ways to finish a pipeline run:
- Close the data channel (recommended lossless drain)
config := gopipeline.NewPipelineConfig().
WithBufferSize(100).
WithFlushSize(50).
WithFlushInterval(50 * time.Millisecond)
// DrainOnCancel is irrelevant here; closing the channel guarantees final flush.
p := gopipeline.NewStandardPipeline(config, func(ctx context.Context, batch []string) error {
// Your processing
return nil
})
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
go func() { _ = p.AsyncPerform(ctx) }()
dataChan := p.DataChan()
go func() {
defer close(dataChan) // writer closes: guarantees a final flush of remaining items
for i := 0; i < 1000; i++ {
select {
case dataChan <- fmt.Sprintf("item-%d", i):
case <-ctx.Done():
return
}
}
}()
- Cancel via context, with best-effort drain on cancel
config := gopipeline.NewPipelineConfig().
WithBufferSize(100).
WithFlushSize(50).
WithFlushInterval(50 * time.Millisecond).
WithDrainOnCancel(true). // enable drain on cancel
WithDrainGracePeriod(150 * time.Millisecond) // bound the drain time window
p := gopipeline.NewStandardPipeline(config, func(ctx context.Context, batch []string) error {
// IMPORTANT: respect ctx; return promptly when ctx.Done() to honor grace window
return nil
})
ctx, cancel := context.WithCancel(context.Background())
go func() { _ = p.AsyncPerform(ctx) }()
dataChan := p.DataChan()
// send some data...
// When you need to stop quickly but still try to flush current partial batch:
cancel() // pipeline will do one best-effort flush within DrainGracePeriod, then exit
Notes:
- Close-the-channel path ensures remaining data is flushed regardless of ctx cancellation.
- Drain-on-cancel is a compromise for fast stop with minimal loss; choose a small DrainGracePeriod (e.g., 50β200ms) and ensure your flush respects the provided context.
Two ways to finish a deduplication pipeline run:
- Close the data channel (recommended lossless drain)
config := gopipeline.NewPipelineConfig().
WithBufferSize(100).
WithFlushSize(50).
WithFlushInterval(50 * time.Millisecond)
p := gopipeline.NewDefaultDeduplicationPipeline(func(ctx context.Context, batch map[string]User) error {
// Your deduped processing
return nil
})
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
go func() { _ = p.AsyncPerform(ctx) }()
ch := p.DataChan()
go func() {
defer close(ch) // writer closes: guarantees a final flush (dedup map is flushed)
for i := 0; i < 1000; i++ {
select {
case ch <- User{ID: fmt.Sprintf("%d", i%200), Name: "N"}: // include duplicates
case <-ctx.Done():
return
}
}
}()
- Cancel via context, with best-effort drain on cancel
config := gopipeline.NewPipelineConfig().
WithBufferSize(100).
WithFlushSize(50).
WithFlushInterval(50 * time.Millisecond).
WithDrainOnCancel(true).
WithDrainGracePeriod(150 * time.Millisecond)
p := gopipeline.NewDefaultDeduplicationPipeline(func(ctx context.Context, batch map[string]User) error {
// IMPORTANT: respect ctx; return promptly to honor the grace window
return nil
})
ctx, cancel := context.WithCancel(context.Background())
go func() { _ = p.AsyncPerform(ctx) }()
ch := p.DataChan()
// send some data...
cancel() // pipeline performs a best-effort flush of current dedup map within DrainGracePeriod, then exits
Notes:
- Dedup mode keeps a map for the current batch; both shutdown strategies ensure the remaining unique entries are flushed.
- For high-duplication inputs, consider a slightly longer FlushInterval to accumulate enough unique items, balanced with your latency SLO.
The pipeline can exit via two distinct paths:
- Channel closed:
- If the current batch is non-empty, a final synchronous flush is performed with context.Background().
- The loop returns nil (graceful shutdown).
- Context canceled:
- DrainOnCancel = false: return ErrContextIsClosed (no final flush).
- DrainOnCancel = true: perform one best-effort final synchronous flush under a separate drainCtx with timeout (DrainGracePeriod, default ~100ms if unset). Returns errors.Join(ErrContextIsClosed, ErrContextDrained).
Detect exit conditions via errors.Is:
err := pipeline.AsyncPerform(ctx)
// ...
if errors.Is(err, ErrContextIsClosed) {
// Exited due to context cancellation
}
if errors.Is(err, ErrContextDrained) {
// A best-effort final drain flush was performed on cancel
}
// On channel-close path, err == nil (graceful shutdown)
Notes:
- The final drain flush is executed synchronously to avoid races on shutdown.
- Your flush function should respect the provided context (drainCtx) and return promptly.
// Batch insert database records
pipeline := gopipeline.NewDefaultStandardPipeline(
func(ctx context.Context, records []DatabaseRecord) error {
return db.BatchInsert(ctx, records)
},
)
// Batch write log files
pipeline := gopipeline.NewDefaultStandardPipeline(
func(ctx context.Context, logs []LogEntry) error {
return logWriter.WriteBatch(logs)
},
)
// Batch call third-party APIs
pipeline := gopipeline.NewDefaultStandardPipeline(
func(ctx context.Context, requests []APIRequest) error {
return apiClient.BatchCall(ctx, requests)
},
)
// User data deduplication processing
pipeline := gopipeline.NewDefaultDeduplicationPipeline(
func(ctx context.Context, users map[string]User) error {
return userService.BatchUpdate(ctx, users)
},
)
// Batch process message queue data
pipeline := gopipeline.NewDefaultStandardPipeline(
func(ctx context.Context, messages []Message) error {
return messageProcessor.ProcessBatch(ctx, messages)
},
)
// Dynamically adjust configuration based on system load
func createAdaptivePipeline() *gopipeline.StandardPipeline[Task] {
config := gopipeline.PipelineConfig{
BufferSize: getOptimalBufferSize(),
FlushSize: getOptimalFlushSize(),
FlushInterval: getOptimalInterval(),
}
return gopipeline.NewStandardPipeline(config, processTaskBatch)
}
func getOptimalBufferSize() uint32 {
// Calculate based on system memory and CPU cores
return uint32(runtime.NumCPU() * 50)
}
func getOptimalFlushSize() uint32 {
// Based on performance tests, around 50 is optimal
return 50
}
pipeline := gopipeline.NewDefaultStandardPipeline(
func(ctx context.Context, batchData []Task) error {
return retryWithBackoff(ctx, func() error {
return processBatch(batchData)
}, 3, time.Second)
},
)
func retryWithBackoff(ctx context.Context, fn func() error, maxRetries int, baseDelay time.Duration) error {
for i := 0; i < maxRetries; i++ {
if err := fn(); err == nil {
return nil
}
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(baseDelay * time.Duration(1<<i)):
// Exponential backoff
}
}
return fmt.Errorf("max retries exceeded")
}
type MetricsPipeline struct {
*gopipeline.StandardPipeline[Event]
processedCount int64
errorCount int64
}
func NewMetricsPipeline() *MetricsPipeline {
mp := &MetricsPipeline{}
mp.StandardPipeline = gopipeline.NewDefaultStandardPipeline(
func(ctx context.Context, events []Event) error {
err := processEvents(events)
atomic.AddInt64(&mp.processedCount, int64(len(events)))
if err != nil {
atomic.AddInt64(&mp.errorCount, 1)
}
return err
},
)
return mp
}
func (mp *MetricsPipeline) GetMetrics() (processed, errors int64) {
return atomic.LoadInt64(&mp.processedCount), atomic.LoadInt64(&mp.errorCount)
}
func gracefulShutdown(pipeline *gopipeline.StandardPipeline[Task]) {
// Create context with timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// Stop accepting new data
// Close data channel
dataChan := pipeline.DataChan()
close(dataChan)
// Wait for processing to complete
done := make(chan struct{})
go func() {
defer close(done)
// Wait for error channel to close, indicating all data has been processed
errorChan := pipeline.ErrorChan(10)
for {
select {
case err, ok := <-errorChan:
if !ok {
return
}
log.Printf("Processing remaining error: %v", err)
case <-ctx.Done():
return
}
}
}()
// Wait for completion or timeout
select {
case <-done:
log.Println("Pipeline graceful shutdown completed")
case <-ctx.Done():
log.Println("Pipeline shutdown timeout")
}
}
Based on the latest performance benchmark test results:
- Data Processing Throughput: ~248 nanoseconds/item (Apple M4)
- Memory Efficiency: 232 bytes/operation, 7 allocations/operation
- Batch Processing Optimization: 5x performance improvement from batch size 1 to 50
- Pipeline Overhead: About 38% slower than direct processing (225.4 vs 162.7 ns/op)
BatchSize1: 740.5 ns/op (Slowest - frequent flushing)
BatchSize10: 251.5 ns/op (Significant improvement)
BatchSize50: 146.5 ns/op (Optimal performance) β
BatchSize100: 163.4 ns/op (Slight decline)
BatchSize500: 198.6 ns/op (Batch too large)
- Optimal Batch Size: Around 50
- Buffer Configuration: BufferSize >= FlushSize * 2
- Flush Interval: 50ms balances latency and throughput
- Async Mode: Recommended for better performance
Error Channel Behavior: Lazily initialized via sync.Once. The first call to
ErrorChan(size int)
decides the buffer size; subsequent calls ignore size. Even if you don't explicitly call it, the pipeline will initialize it on first error send and write errors non-blockingly. If the channel isn't consumed and the buffer fills, subsequent errors are dropped (no blocking or panic).
Recommended to Listen to Error Channel: If you call
ErrorChan(size int)
, it's recommended to listen to the error channel and use select statements to avoid infinite waiting.
Channel Management: v2 version follows the "writer closes" principle, users need to control the closing timing of
DataChan()
.
β οΈ Pipeline Reuse Warning: If you need to reuse the same pipeline instance for multiple runs (callingSyncPerform()
orAsyncPerform()
multiple times), DO NOT close the DataChan prematurely.DataChan()
returns the same channel instance, and once closed, it cannot be reused. Use context cancellation or timeout to control pipeline lifecycle instead.
- Reasonable Batch Size: Based on performance tests, recommend using batch size around 50
β οΈ Must Listen to Error Channel: Use select statements to avoid blocking, handle errors from batch processing promptly- Proper Channel Closure: Use defer close(dataChan) to ensure proper channel closure
- Context Management: Use context to control pipeline lifecycle
- Deduplication Key Design: Ensure uniqueness and stability of deduplication keys
- Performance Tuning: Choose appropriate configuration parameters based on benchmark test results
β οΈ Pipeline Reuse: For repeated pipeline usage, avoid closing DataChan prematurely. Use context timeout/cancellation instead of channel closure to end processing
When you need to run the same pipeline multiple times:
// β
Correct: Use context to control lifecycle
pipeline := gopipeline.NewStandardPipeline(config, batchFunc)
dataChan := pipeline.DataChan() // Get channel once
// First run
ctx1, cancel1 := context.WithTimeout(context.Background(), time.Second*30)
go pipeline.SyncPerform(ctx1)
// Send data without closing channel
for _, data := range firstBatch {
select {
case dataChan <- data:
case <-ctx1.Done():
break
}
}
cancel1() // End first run
// Second run - reuse same pipeline and channel
ctx2, cancel2 := context.WithTimeout(context.Background(), time.Second*30)
go pipeline.SyncPerform(ctx2)
// Send data again without closing channel
for _, data := range secondBatch {
select {
case dataChan <- data:
case <-ctx2.Done():
break
}
}
cancel2() // End second run
// β Wrong: Closing channel prevents reuse
// close(dataChan) // Don't do this if you plan to reuse!
The framework provides comprehensive error handling mechanisms:
ErrContextIsClosed
: Context is closedErrPerformLoopError
: Execution loop errorErrChannelIsClosed
: Channel is closed
v2 version provides a robust error handling mechanism with lazy initialization and non-blocking semantics:
- First-call Decides Size:
ErrorChan(size int)
uses sync.Once; the first call decides buffer size, later calls ignore size. If never called explicitly, a default buffer size is used on first internal send. - Optional Consumption: Listening to the error channel is optional; if unconsumed and the buffer fills, subsequent errors are dropped to avoid blocking.
- Non-blocking Send: Errors are sent non-blockingly, ensuring the pipeline isn't blocked.
- Buffer Full Handling: When the buffer is full, new errors are discarded instead of blocking; no panic occurs.
Method 1: Listen to Errors (Recommended)
// Create error channel and listen
errorChan := pipeline.ErrorChan(10) // Specify buffer size
go func() {
for {
select {
case err, ok := <-errorChan:
if !ok {
return // Channel closed
}
log.Printf("Processing error: %v", err)
// Handle according to error type
case <-ctx.Done():
return // Context cancelled
}
}
}()
Method 2: Run Without Consuming Errors (Simplified)
// You may choose not to consume the error channel.
// The pipeline initializes the error channel on demand and sends errors non-blockingly.
// If the buffer fills and nobody consumes, subsequent errors are dropped (no blocking/panic).
pipeline := gopipeline.NewStandardPipeline(config, flushFunc)
go pipeline.AsyncPerform(ctx)
- Near-zero Overhead: Error channel is initialized once on demand; sends are non-blocking and lightweight.
- Async Processing: Error sending runs independently, minimizing impact on the main flow.
- Smart Discard: When the buffer is full and unconsumed, subsequent errors are dropped, preventing blocking.
The project includes complete unit tests and benchmark tests:
# Run all tests
go test ./...
# Run unit tests
go test -v ./... -run Test
# Run benchmark tests
go test -bench=. ./...
# Run standard pipeline benchmark tests
go test -bench=BenchmarkStandardPipeline ./...
# Run deduplication pipeline benchmark tests
go test -bench=BenchmarkDeduplicationPipeline ./...
# Run performance benchmark tests
go test -bench=BenchmarkPipelineDataProcessing ./...
# Run batch efficiency tests
go test -bench=BenchmarkPipelineBatchSizes ./...
# Run memory usage tests
go test -bench=BenchmarkPipelineMemoryUsage ./...
Latest benchmark test results on Apple M4 processor:
BenchmarkPipelineDataProcessing-10 1000 248.2 ns/op 232 B/op 7 allocs/op
BenchmarkPipelineVsDirectProcessing/Pipeline-10 1000 225.4 ns/op
BenchmarkPipelineVsDirectProcessing/Direct-10 1000 162.7 ns/op
BenchmarkPipelineMemoryUsage-10 1000 232.2 ns/op 510 B/op 9 allocs/op
BenchmarkPipelineBatchSizes/BatchSize1-10 500 740.5 ns/op 500.0 items_processed
BenchmarkPipelineBatchSizes/BatchSize10-10 500 251.5 ns/op 500.0 items_processed
BenchmarkPipelineBatchSizes/BatchSize50-10 500 146.5 ns/op 500.0 items_processed β
BenchmarkPipelineBatchSizes/BatchSize100-10 500 163.4 ns/op 500.0 items_processed
BenchmarkPipelineBatchSizes/BatchSize500-10 500 198.6 ns/op 500.0 items_processed
- Optimal Batch Size: Around 50, 5x performance improvement
- Pipeline Overhead: About 38%, in exchange for better architecture and maintainability
- Memory Efficiency: About 232-510 bytes memory usage per data item
- Processing Capacity: Can process millions of records per second
Deduplication pipeline adds the following performance characteristics on top of standard pipeline:
- Memory Usage: Uses map structure to store data, slightly higher memory usage than standard pipeline
- Processing Latency: Deduplication logic adds about 10-15% processing time
- Key Generation Overhead: Need to generate unique keys for each data item
- Batch Efficiency: Batch size after deduplication may be smaller than configured FlushSize
Performance Comparison:
- Standard Pipeline: ~225 ns/op
- Deduplication Pipeline: ~260 ns/op (about 15% overhead increase)
A: Configuration recommendations based on performance tests:
- High Throughput Scenario: FlushSize=50, BufferSize=100, FlushInterval=50ms
- Low Latency Scenario: FlushSize=10, BufferSize=50, FlushInterval=10ms
- Memory Constrained Scenario: FlushSize=20, BufferSize=40, FlushInterval=100ms
- CPU Intensive Processing: Use async mode, appropriately increase buffer size
A: Important improvements in v2:
- Removed Add() Method: Changed to DataChan() API, follows "writer closes" principle
- Error Channel Improvement:
ErrorChan(size int)
uses lazy init; the first call decides the buffer size (later calls ignore size). If never called, a default size is used internally on first send. - Performance Optimization: Default configuration optimized based on benchmark tests
- Better Lifecycle Management: Users control data channel closing timing
A:
- Violates Go Principles: Add() method violates Go's "writer closes" principle
- Better Control: DataChan() gives users complete control over data sending and channel closing
- More Conventional: This is the standard Go channel usage pattern
A: Migration steps:
// v1 approach
pipeline.Add(ctx, data)
// v2 approach
dataChan := pipeline.DataChan()
go func() {
defer close(dataChan)
for _, data := range dataList {
select {
case dataChan <- data:
case <-ctx.Done():
return
}
}
}()
A: The framework internally handles panic, but it's recommended to add recover in batch processing functions:
func(ctx context.Context, batchData []Task) error {
defer func() {
if r := recover(); r != nil {
log.Printf("Batch processing panic: %v", r)
}
}()
// Processing logic
return nil
}
Symptoms: Memory usage continuously growing Causes:
- Error channel not being consumed
- Data channel not properly closed
- Memory leaks in batch processing functions
Solutions:
// Ensure error channel is consumed
errorChan := pipeline.ErrorChan(10)
go func() {
for {
select {
case err, ok := <-errorChan:
if !ok {
return
}
// Handle error
case <-ctx.Done():
return
}
}
}()
// Ensure data channel is closed
dataChan := pipeline.DataChan()
defer close(dataChan)
Symptoms: Processing speed slower than expected Troubleshooting Steps:
- Check if batch size is around 50
- Ensure BufferSize >= FlushSize * 2
- Use async mode
- Check batch processing function execution time
Optimization Recommendations:
// Use performance-optimized configuration
config := gopipeline.PipelineConfig{
BufferSize: 100, // >= FlushSize * 2
FlushSize: 50, // Optimal batch size
FlushInterval: time.Millisecond * 50, // Balance latency and throughput
}
Symptoms: Some data not being processed Causes:
- Context cancelled too early
- Data channel closed too early
- Batch processing function returns error but not handled
Solutions:
// Use sufficient timeout
ctx, cancel := context.WithTimeout(context.Background(), time.Minute*5)
defer cancel()
// Ensure all data is sent before closing channel
dataChan := pipeline.DataChan()
go func() {
defer close(dataChan) // Close after all data is sent
for _, data := range allData {
select {
case dataChan <- data:
case <-ctx.Done():
return
}
}
}()
This project is licensed under the MIT License - see the LICENSE file for details.
Welcome to submit Issues and Pull Requests to improve this project!