You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current AGNTCY Directory streaming APIs (Push, Pull, Lookup, Delete) follow a fail-fast pattern where any error encountered during stream processing immediately closes the entire stream. While this approach ensures strict error handling for single-operation use cases, it poses significant challenges for production batch operations and distributed system resilience.
Key challenges with the current approach:
Batch operations fail completely when a single item encounters an error, requiring clients to restart the entire batch
Limited observability - clients cannot distinguish between partial success and total failure
Inefficient retry logic - failed batches must be retried in their entirety, even if only one item failed
Poor resilience - transient errors in distributed systems affect entire batches rather than individual items
Inconsistent with PushReferrer - which already implements error-tolerant streaming with per-item error responses
This architectural limitation becomes especially problematic for:
Bulk imports and migrations (1000s of records)
ETL pipelines synchronizing between systems
Multi-tenant operations where one tenant's error shouldn't block others
Federation scenarios with intermittent connectivity
Production operations requiring detailed error tracking and metrics
By introducing error-tolerant streaming, we enable resilient batch operations while maintaining backward compatibility for simple use cases that benefit from fail-fast behavior.
🎯 Proposal: Error-Tolerant Streaming Architecture
We propose evolving all streaming APIs (Push, Pull, Lookup, Delete) to support per-item error handling rather than stream-level failures. This allows:
✅ Partial success in batch operations - process what succeeds, report what fails
✅ Detailed error information - error codes and messages per item
✅ Resilient production operations - transient failures don't kill entire batches
✅ Better observability - metrics on success rates and failure patterns
✅ Architectural consistency - all streaming methods behave uniformly
Option 4: Request-Level Flags (Recommended for v1)
Pros:
✅ Single API surface
✅ Per-request control
✅ Easy to add features (dry-run, etc.)
Cons:
⚠️ Still requires proto changes
⚠️ Slightly more complex implementation
🎯 Recommended Approach
Based on the AGNTCY Directory's current state and the testbed environment:
Phase 1: Testbed Implementation (Current v0.5.x)
Implement error-tolerant streaming as the default behavior
Provide both fail-fast and tolerant client methods
Gather feedback from testbed participants
Refine error codes and response structures
Rationale: Since we're in testbed/staging (no SLA guarantees), this is the ideal time to make architectural improvements before 1.0 production readiness.
Ready to make batch operations resilient?
Share your thoughts, vote on the approach, and help shape the future of AGNTCY Directory streaming APIs! 🚀
This proposal is part of the AGNTCY Directory Testbed initiative. Join the testbed to experiment with these features before they reach production. Learn more: AGNTCY Agent Directory Testbed CFP
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
The current AGNTCY Directory streaming APIs (Push, Pull, Lookup, Delete) follow a fail-fast pattern where any error encountered during stream processing immediately closes the entire stream. While this approach ensures strict error handling for single-operation use cases, it poses significant challenges for production batch operations and distributed system resilience.
Key challenges with the current approach:
This architectural limitation becomes especially problematic for:
By introducing error-tolerant streaming, we enable resilient batch operations while maintaining backward compatibility for simple use cases that benefit from fail-fast behavior.
🎯 Proposal: Error-Tolerant Streaming Architecture
We propose evolving all streaming APIs (Push, Pull, Lookup, Delete) to support per-item error handling rather than stream-level failures. This allows:
✅ Partial success in batch operations - process what succeeds, report what fails
✅ Detailed error information - error codes and messages per item
✅ Resilient production operations - transient failures don't kill entire batches
✅ Better observability - metrics on success rates and failure patterns
✅ Architectural consistency - all streaming methods behave uniformly
Current Behavior
Proposed Behavior
Note: Stream-level errors (network, authentication, etc.) would still close the stream as they represent transport/infrastructure failures.
📋 Proposed Changes
1. Protocol Buffer Updates
All streaming methods will return structured responses with error information:
2. Server Controller Pattern
Server controllers will process each item individually and return errors in responses:
3. Client API Options
We propose providing dual client methods to support both use cases:
Option A: Explicit Naming (Recommended)
Option B: Concise Naming
Option C: Default Tolerant (Breaking Change)
4. Streaming Package Enhancement
Create new processors for error-tolerant streams:
🔄 Implementation Options Comparison
Option 1: New RPC Methods (PushV2, PullV2, etc.)
Pros:
Cons:
Option 2: API Versioning (v2 API)
Pros:
Cons:
Option 3: In-Place Breaking Change
Pros:
Cons:
Option 4: Request-Level Flags (Recommended for v1)
Pros:
Cons:
🎯 Recommended Approach
Based on the AGNTCY Directory's current state and the testbed environment:
Phase 1: Testbed Implementation (Current v0.5.x)
Rationale: Since we're in testbed/staging (no SLA guarantees), this is the ideal time to make architectural improvements before 1.0 production readiness.
Phase 2: Stabilization (v0.6.x - v0.9.x)
Phase 3: v1.0 Release
📊 Design Considerations
1. Error Classification
Item-level errors (tolerant):
Stream-level errors (fail-fast):
2. Transaction Semantics
Important: This proposal makes operations non-atomic at batch level:
Alternative: For atomic batch operations, consider adding:
3. Backward Compatibility Strategy
Proposed approach:
4. Federation Implications
Error-tolerant streaming is especially important for federation:
5. Observability & Metrics
New metrics to add:
6. Enhanced Error Codes
Structured error codes for programmatic handling:
🚀 Expected Benefits
For Directory Users
For Federation Partners
For AGNTCY Ecosystem
📚 Implementation Checklist
💬 Call for Feedback
We invite the AGNTCY community, testbed participants, and TSC members to provide feedback on:
Questions to Consider
🔗 Resources
Ready to make batch operations resilient?
Share your thoughts, vote on the approach, and help shape the future of AGNTCY Directory streaming APIs! 🚀
This proposal is part of the AGNTCY Directory Testbed initiative. Join the testbed to experiment with these features before they reach production. Learn more: AGNTCY Agent Directory Testbed CFP
Beta Was this translation helpful? Give feedback.
All reactions