Skip to content

Cnt scalability#316

Open
yl-nuwan wants to merge 45 commits into
developfrom
CNT-scalability
Open

Cnt scalability#316
yl-nuwan wants to merge 45 commits into
developfrom
CNT-scalability

Conversation

@yl-nuwan

Copy link
Copy Markdown
Contributor

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update
  • CI/CD update
  • Other (please describe):

Related Issues

Fixes #
Relates to #

Changes Made

Testing

  • Unit tests pass locally
  • Integration tests pass locally
  • Manual testing completed
  • New tests added for changes

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

yl-nuwan added 30 commits June 5, 2026 10:01
…moDB support

- Separate agent runner execution role (image pull, logs) from task role (SQS, DynamoDB access)
- Add CloudWatch Logs policy to agent runner task role for container log writes
- Add DynamoDB memory table access policy for agent runner task role
- Update ECS task definition to use execution role for image operations
- Add comprehensive docstring updates to akagentrunner.py for boto3 and Lambda message format handling
- Update sqs_handler.py to handle both Lambda camelCase and boto3 PascalCase attribute keys
- Add queue-mode-guide.md documentation for queue mode deployment and configuration
…ents

- Add ECSQueueRequestHandler class that bypasses ChatService and directly enqueues requests to SQS
- Implement sync mode (REST_SYNC) to wait for responses in DynamoDB Response Store
- Implement async mode (REST_ASYNC) to return request_id for polling
- Add GET /api/v1/chat/{session_id} endpoint for async response polling
- Update ECSRESTService to use queue-aware handler instead of default REST API
- Export ECSQueueRequestHandler from containerized module __init__.py
- Update example app_rest_service.py to demonstrate queue-based request handling
- Enables scalable ECS deployments with asynchronous agent execution and DynamoDB response storage
…submission

- Change message_attributes dict parameter to request_id positional argument
- Align with SQS API expectations for FIFO queue message attributes
- Simplify queue message construction by using native request_id parameter
- Maintains backward compatibility with existing queue processing logic
…ecycle

- Add detailed lifecycle logging with stage markers ([AGENT START], [AGENT PROCESSING], [AGENT RESPONSE], [AGENT DONE]) to akagentrunner.py for improved traceability
- Add structured logging for output processing pipeline ([OUTPUT START], [OUTPUT STORE], [OUTPUT DONE]) in akrestservice.py with request/session tracking
- Generate unique request_id using uuid.uuid4() instead of falling back to session_id in ecs_queue_handler.py for proper request isolation
- Add comprehensive logging at each stage of request lifecycle ([REQUEST START], [ENQUEUED], [WAITING], [RESPONSE FOUND], [WAIT START], [WAIT SUCCESS], [WAIT RETRY], [WAIT TIMEOUT]) in ecs_queue_handler.py
- Add debug_response_store.py utility script for inspecting DynamoDB response store state during development
- Include request_id, session_id, agent, and prompt preview in logs for better debugging and tracing across async/sync execution modes
…GET integration

- Fix path parameter mapping from hardcoded sessionId to dynamic $request.path.sessionId
- Add blank line for improved readability in resource configuration
- Ensure API Gateway correctly forwards session ID from request path to backend service
…ant comments

- Rename sqs.tf to queue.tf for clearer module scope
- Remove SQS Queue Mode header comment (redundant with file purpose)
- Remove IAM section header comment (redundant with resource names)
- Remove CloudWatch Logs section comment (redundant with resource names)
- Improve code clarity by reducing comment clutter while maintaining resource documentation
…ple README

- Add comprehensive Queue Mode section to containerized module README with architecture diagram and configuration example
- Document scalable queue mode use cases and processing architecture (REST Service threads + Agent Runner)
- Add Queue Mode input variables section covering SQS visibility timeouts and Agent Runner configuration
- Restructure openai-dynamodb-scalable example README with improved architecture overview and deployed resources documentation
- Remove clean.sh and rebuild.sh scripts from scalable example (moved to root)
- Add .terraform.lock.hcl file to deploy directory for reproducible Terraform deployments
- Provides clear guidance for implementing high-throughput, asynchronously-processed agent workloads
- Convert single quotes to double quotes for string literals
- Consolidate multi-line string concatenations to single lines
- Remove trailing whitespace and normalize blank lines
- Simplify multi-line function calls and error messages
- Apply consistent formatting across ECSAgentRunner, ECSRESTService, ECS queue handler, and SQS poller modules
- Ensure consistent code style across containerized deployment infrastructure
- Add session_id parameter validation in ECS queue handler GET endpoint
- Implement security check to verify response session_id matches URL path session_id
- Add session_id validation in serverless Lambda REST_ASYNC polling operation
- Enhance logging to include session_id in poll operation for better traceability
- Return 403 Forbidden with detailed error when session_id mismatch is detected
- Prevent unauthorized access to responses belonging to different sessions
- Improve error messages to distinguish between missing and mismatched session IDs
…queue polling

- Add validation to require either request_id or session_id in GET polling requests
- Return 404 with detailed error message when no response is found instead of PENDING status
- Update error response format for session ID mismatch to use FORBIDDEN status code
- Include request_id and session_id in error response details for better debugging
- Enhance error messages with context about message unavailability and retry guidance
…ponses

- Change HTTP status from 403 (FORBIDDEN) to 404 (NOT_FOUND) in ECS queue handler when response message is not found
- Update error message to indicate message unavailability rather than session mismatch
- Add request_id to error detail in ECS queue handler response
- Align serverless Lambda router error response with queue handler for consistency
- Improve error messaging clarity for clients when async response messages cannot be located
- Add scaling.tf with Lambda-based BacklogPerTask metric calculation
- Implement EventBridge trigger for metric computation every minute
- Add target tracking scaling policy for Agent Runner ECS service
- Add autoscaling configuration variables to variables.tf
- Document autoscaling parameters and usage in README
- Enable configurable min/max task counts and scale in/out cooldown periods
- Add validation to require queue_mode when autoscaling is enabled
- Update example deployment to demonstrate autoscaling configuration
- Supports both sync and async queue modes with automatic task scaling based on queue depth
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants