Error-Tolerant Streaming API Architecture Proposal #708

tkircsi · 2025-11-19T18:26:44Z

tkircsi
Nov 19, 2025
Maintainer

Motivation

The current AGNTCY Directory streaming APIs (Push, Pull, Lookup, Delete) follow a fail-fast pattern where any error encountered during stream processing immediately closes the entire stream. While this approach ensures strict error handling for single-operation use cases, it poses significant challenges for production batch operations and distributed system resilience.

Key challenges with the current approach:

Batch operations fail completely when a single item encounters an error, requiring clients to restart the entire batch
Limited observability - clients cannot distinguish between partial success and total failure
Inefficient retry logic - failed batches must be retried in their entirety, even if only one item failed
Poor resilience - transient errors in distributed systems affect entire batches rather than individual items
Inconsistent with PushReferrer - which already implements error-tolerant streaming with per-item error responses

This architectural limitation becomes especially problematic for:

Bulk imports and migrations (1000s of records)
ETL pipelines synchronizing between systems
Multi-tenant operations where one tenant's error shouldn't block others
Federation scenarios with intermittent connectivity
Production operations requiring detailed error tracking and metrics

By introducing error-tolerant streaming, we enable resilient batch operations while maintaining backward compatibility for simple use cases that benefit from fail-fast behavior.

🎯 Proposal: Error-Tolerant Streaming Architecture

We propose evolving all streaming APIs (Push, Pull, Lookup, Delete) to support per-item error handling rather than stream-level failures. This allows:

✅ Partial success in batch operations - process what succeeds, report what fails
✅ Detailed error information - error codes and messages per item
✅ Resilient production operations - transient failures don't kill entire batches
✅ Better observability - metrics on success rates and failure patterns
✅ Architectural consistency - all streaming methods behave uniformly

Current Behavior

// Fail-fast: any error closes the stream
for record := range records {
    if err := validate(record); err != nil {
        return err  // ❌ Entire stream terminates
    }
    if err := store.Push(record); err != nil {
        return err  // ❌ Entire stream terminates
    }
}

Proposed Behavior

// Error-tolerant: errors returned in response
for record := range records {
    response := processRecord(record)  // Validation or storage errors
    if !response.Success {
        stream.Send(PushResponse{
            Success: false,
            ErrorCode: "VALIDATION_FAILED",
            ErrorMessage: "Invalid schema version",
        })
    } else {
        stream.Send(PushResponse{
            Success: true,
            RecordRef: ref,
        })
    }
}

Note: Stream-level errors (network, authentication, etc.) would still close the stream as they represent transport/infrastructure failures.

📋 Proposed Changes

1. Protocol Buffer Updates

All streaming methods will return structured responses with error information:

service StoreService {
  rpc Push(stream core.v1.Record) returns (stream PushResponse);
  rpc Pull(stream core.v1.RecordRef) returns (stream PullResponse);
  rpc Lookup(stream core.v1.RecordRef) returns (stream LookupResponse);
  rpc Delete(stream core.v1.RecordRef) returns (stream DeleteResponse);
  
  // Existing PushReferrer already uses this pattern ✅
  rpc PushReferrer(stream PushReferrerRequest) returns (stream PushReferrerResponse);
  rpc PullReferrer(stream PullReferrerRequest) returns (stream PullReferrerResponse);
}

message PushResponse {
  bool success = 1;
  optional core.v1.RecordRef record_ref = 2;  // Present on success
  optional string error_code = 3;              // Present on failure
  optional string error_message = 4;           // Present on failure
}

message PullResponse {
  bool success = 1;
  optional core.v1.Record record = 2;
  optional string error_code = 3;
  optional string error_message = 4;
}

message LookupResponse {
  bool success = 1;
  optional core.v1.RecordMeta record_meta = 2;
  optional string error_code = 3;
  optional string error_message = 4;
}

message DeleteResponse {
  bool success = 1;
  optional string error_code = 2;
  optional string error_message = 3;
}

2. Server Controller Pattern

Server controllers will process each item individually and return errors in responses:

func (s storeCtrl) Push(stream storev1.StoreService_PushServer) error {
    for {
        record, err := stream.Recv()
        if errors.Is(err, io.EOF) {
            return nil  // Normal stream completion
        }
        if err != nil {
            return err  // Stream-level errors still fail fast
        }

        // Process with error tolerance
        response := s.processPushRecord(stream.Context(), record)
        
        if err := stream.Send(response); err != nil {
            return err  // Send failures are stream-level
        }
    }
}

func (s storeCtrl) processPushRecord(ctx context.Context, record *corev1.Record) *storev1.PushResponse {
    // Validation errors -> return in response
    if isValid, errs, err := record.Validate(); err != nil || !isValid {
        return &storev1.PushResponse{
            Success: false,
            ErrorCode: pointer.String("INVALID_ARGUMENT"),
            ErrorMessage: pointer.String(fmt.Sprintf("validation failed: %v", errs)),
        }
    }

    // Storage errors -> return in response
    ref, err := s.pushRecordToStore(ctx, record)
    if err != nil {
        st := status.Convert(err)
        return &storev1.PushResponse{
            Success: false,
            ErrorCode: pointer.String(st.Code().String()),
            ErrorMessage: pointer.String(st.Message()),
        }
    }

    return &storev1.PushResponse{
        Success: true,
        RecordRef: ref,
    }
}

3. Client API Options

We propose providing dual client methods to support both use cases:

Option A: Explicit Naming (Recommended)

// Fail-fast versions (backward-compatible behavior)
func (c *Client) Push(ctx, record) (*RecordRef, error)
func (c *Client) PushBatch(ctx, records) ([]*RecordRef, error)
func (c *Client) PushStream(ctx, recordsCh) (StreamResult[RecordRef], error)

// Error-tolerant versions (new capability)
func (c *Client) PushTolerant(ctx, record) (*PushResponse, error)
func (c *Client) PushBatchTolerant(ctx, records) ([]*PushResponse, error)
func (c *Client) PushStreamTolerant(ctx, recordsCh) (StreamResult[PushResponse], error)

Option B: Concise Naming

func (c *Client) PushStream(ctx, recordsCh) (StreamResult[RecordRef], error)        // fail-fast
func (c *Client) PushStreamErr(ctx, recordsCh) (StreamResult[PushResponse], error)  // tolerant

Option C: Default Tolerant (Breaking Change)

func (c *Client) PushStream(ctx, recordsCh) (StreamResult[PushResponse], error)       // tolerant (default)
func (c *Client) PushStreamStrict(ctx, recordsCh) (StreamResult[RecordRef], error)    // fail-fast (explicit)

4. Streaming Package Enhancement

Create new processors for error-tolerant streams:

// client/streaming/tolerant_stream.go

// ProcessTolerantBidiStream handles per-message errors
func ProcessTolerantBidiStream[InT, OutT any, RespT ResponseWithError](
    ctx context.Context,
    stream BidiStream[InT, RespT],
    inputCh <-chan *InT,
) (StreamResult[RespT], error)

// ResponseWithError interface for responses with error handling
type ResponseWithError interface {
    IsSuccess() bool
    GetError() (code string, message string)
}

🔄 Implementation Options Comparison

Option 1: New RPC Methods (PushV2, PullV2, etc.)

Pros:

✅ No breaking changes
✅ Gradual migration path
✅ Both behaviors available simultaneously

Cons:

⚠️ API duplication
⚠️ Increased maintenance burden
⚠️ Eventually need deprecation strategy

Option 2: API Versioning (v2 API)

Pros:

✅ Clean separation of old/new behavior
✅ Clear migration path
✅ Industry standard approach

Cons:

⚠️ Requires v2 package/module
⚠️ Clients must explicitly migrate
⚠️ Longer transition period

Option 3: In-Place Breaking Change

Pros:

✅ Clean architecture
✅ No duplication
✅ Forces ecosystem alignment

Cons:

❌ Breaks existing clients
❌ Requires coordinated updates
❌ Risk for production deployments

Option 4: Request-Level Flags (Recommended for v1)

Pros:

✅ Single API surface
✅ Per-request control
✅ Easy to add features (dry-run, etc.)

Cons:

⚠️ Still requires proto changes
⚠️ Slightly more complex implementation

🎯 Recommended Approach

Based on the AGNTCY Directory's current state and the testbed environment:

Phase 1: Testbed Implementation (Current v0.5.x)

Implement error-tolerant streaming as the default behavior
Provide both fail-fast and tolerant client methods
Gather feedback from testbed participants
Refine error codes and response structures

Rationale: Since we're in testbed/staging (no SLA guarantees), this is the ideal time to make architectural improvements before 1.0 production readiness.

Phase 2: Stabilization (v0.6.x - v0.9.x)

Iterate based on testbed feedback
Document migration patterns
Add comprehensive metrics and observability
Test federation scenarios with error handling

Phase 3: v1.0 Release

Stabilized error-tolerant streaming API
Dual client methods (fail-fast + tolerant)
Production-ready error handling patterns
Comprehensive documentation and examples

📊 Design Considerations

1. Error Classification

Item-level errors (tolerant):

✅ Validation failures (malformed records, schema violations)
✅ Storage errors (disk full, quota exceeded)
✅ Not found errors (Pull, Lookup, Delete)
✅ Already exists errors (Push with duplicate CID)
✅ Permission errors (per-record authorization)

Stream-level errors (fail-fast):

❌ Network failures (connection lost)
❌ Authentication failures (invalid credentials)
❌ Protocol errors (malformed gRPC messages)
❌ Send/Recv failures (transport issues)

2. Transaction Semantics

Important: This proposal makes operations non-atomic at batch level:

Each item is processed independently
Partial batch success is valid
No automatic rollback of successful items
Clients must implement retry/reconciliation logic

Alternative: For atomic batch operations, consider adding:

message PushBatchRequest {
  repeated core.v1.Record records = 1;
  bool atomic = 2;  // All-or-nothing semantics
}

3. Backward Compatibility Strategy

Proposed approach:

Add new response types with success/error fields
Keep existing method names (breaking change acceptable in testbed)
Provide both fail-fast and tolerant client wrappers
Document migration guide with examples
Add deprecation notices if keeping old behavior

4. Federation Implications

Error-tolerant streaming is especially important for federation:

Cross-registry sync - don't fail entire sync on one bad record
Multi-tenant operations - isolate errors between tenants
Intermittent connectivity - partial success when network is unstable
Different registry capabilities - some may reject records others accept

5. Observability & Metrics

New metrics to add:

// Per-operation metrics
- store_push_success_total
- store_push_failure_total
- store_push_validation_failure_total
- store_push_storage_failure_total

// Batch operation metrics
- store_push_batch_items_total
- store_push_batch_success_rate
- store_push_batch_partial_failure_total

// Error breakdown
- store_errors_by_code{code="INVALID_ARGUMENT"}
- store_errors_by_code{code="ALREADY_EXISTS"}

6. Enhanced Error Codes

Structured error codes for programmatic handling:

enum ErrorCode {
  ERROR_CODE_UNSPECIFIED = 0;
  ERROR_CODE_VALIDATION_FAILED = 1;
  ERROR_CODE_INVALID_ARGUMENT = 2;
  ERROR_CODE_NOT_FOUND = 3;
  ERROR_CODE_ALREADY_EXISTS = 4;
  ERROR_CODE_STORAGE_FAILED = 5;
  ERROR_CODE_QUOTA_EXCEEDED = 6;
  ERROR_CODE_PERMISSION_DENIED = 7;
  ERROR_CODE_INTERNAL_ERROR = 8;
}

🚀 Expected Benefits

For Directory Users

Resilient batch operations - bulk imports don't fail on single bad record
Better error visibility - know exactly which records failed and why
Efficient retries - retry only failed items, not entire batch
Production-ready - handle transient failures gracefully

For Federation Partners

Robust synchronization - cross-registry sync handles partial failures
Multi-tenant safety - tenant errors don't affect others
Heterogeneous registries - different validation rules across registries

For AGNTCY Ecosystem

Consistent API patterns - all streaming methods behave uniformly
Aligned with PushReferrer - leverage existing pattern
Better observability - metrics on success rates and error patterns
Industry standard - follows gRPC streaming best practices

📚 Implementation Checklist

💬 Call for Feedback

We invite the AGNTCY community, testbed participants, and TSC members to provide feedback on:

Approach preference - which implementation option works best for your use case?
Client API naming - which naming convention is most intuitive?
Error semantics - which errors should be tolerant vs fail-fast?
Transaction requirements - do you need atomic batch operations?
Federation concerns - how does this affect cross-registry scenarios?
Migration timeline - how quickly can you adapt to this change?

Questions to Consider

Should validation errors be tolerant or fail-fast?
Do you need atomic batch semantics (all-or-nothing)?
Would request-level flags (fail_fast=true) be useful?
Should we provide batch summary metadata?
Are there other operations that need error tolerance?

🔗 Resources

Ready to make batch operations resilient?
Share your thoughts, vote on the approach, and help shape the future of AGNTCY Directory streaming APIs! 🚀

This proposal is part of the AGNTCY Directory Testbed initiative. Join the testbed to experiment with these features before they reach production. Learn more: AGNTCY Agent Directory Testbed CFP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error-Tolerant Streaming API Architecture Proposal #708

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Error-Tolerant Streaming API Architecture Proposal #708

Uh oh!

tkircsi Nov 19, 2025 Maintainer

Motivation

🎯 Proposal: Error-Tolerant Streaming Architecture

Current Behavior

Proposed Behavior

📋 Proposed Changes

1. Protocol Buffer Updates

2. Server Controller Pattern

3. Client API Options

Option A: Explicit Naming (Recommended)

Option B: Concise Naming

Option C: Default Tolerant (Breaking Change)

4. Streaming Package Enhancement

🔄 Implementation Options Comparison

Option 1: New RPC Methods (PushV2, PullV2, etc.)

Option 2: API Versioning (v2 API)

Option 3: In-Place Breaking Change

Option 4: Request-Level Flags (Recommended for v1)

🎯 Recommended Approach

Phase 1: Testbed Implementation (Current v0.5.x)

Phase 2: Stabilization (v0.6.x - v0.9.x)

Phase 3: v1.0 Release

📊 Design Considerations

1. Error Classification

2. Transaction Semantics

3. Backward Compatibility Strategy

4. Federation Implications

5. Observability & Metrics

6. Enhanced Error Codes

🚀 Expected Benefits

For Directory Users

For Federation Partners

For AGNTCY Ecosystem

📚 Implementation Checklist

💬 Call for Feedback

Questions to Consider

🔗 Resources

Replies: 0 comments

tkircsi
Nov 19, 2025
Maintainer