-
Notifications
You must be signed in to change notification settings - Fork 11
Processing service V2 - Phase 1 #987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for antenna-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Exciting! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces Processing Service V2, enabling a pull-based task queue architecture using NATS JetStream instead of the push-based Celery approach. Workers can now pull tasks via HTTP endpoints, process them independently, and acknowledge completion without maintaining persistent connections.
Key changes:
- Added NATS JetStream integration for distributed task queuing with configurable visibility timeouts
- Introduced new REST API endpoints for task pulling (
/jobs/{id}/tasks) and result submission (/jobs/{id}/result) - Implemented Redis-based progress tracking to handle asynchronous worker updates
Reviewed Changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements/base.txt | Added nats-py dependency for NATS client support |
| object_model_diagram.md | Added comprehensive Mermaid diagram documenting ML pipeline system architecture |
| docker-compose.yml | Added NATS JetStream service with health checks and monitoring |
| config/settings/base.py | Added NATS_URL configuration setting |
| ami/utils/nats_queue.py | New TaskQueueManager class for NATS JetStream operations |
| ami/jobs/views.py | Added task pulling and result submission endpoints with pipeline filtering |
| ami/jobs/utils.py | Helper function for running async code in sync Django context |
| ami/jobs/tasks.py | New Celery task for processing pipeline results asynchronously |
| ami/jobs/task_state.py | TaskStateManager for Redis-based job progress tracking |
| ami/jobs/models.py | Added queue_images_to_nats method and NATS cleanup logic |
| ami/base/views.py | Fixed request.data handling when not a dict |
| README.md | Added NATS dashboard documentation link |
| .vscode/launch.json | Added debug configurations for Django and Celery containers |
| .envs/.local/.django | Added NATS_URL environment variable |
| .dockerignore | Expanded with comprehensive ignore patterns |
Comments suppressed due to low confidence (1)
object_model_diagram.md:1
- The comment at line 13 appears to be template text from instructions rather than actual documentation content. This namedtuple field description doesn't match the file's purpose as an object model diagram.
# Object Model Diagram: ML Pipeline System
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds NATS JetStream task queuing, a TaskQueueManager, Redis-backed per-job task-state tracking, async ML job queuing behind a feature flag, Celery result-processing with NATS ACKing, API endpoints to reserve/submit tasks, tests, Docker/CI and config updates, and a new nats dependency. Changes
sequenceDiagram
participant Client
participant MLJob
participant Flags as FeatureFlags
participant QueueMgr as TaskQueueManager
participant State as TaskStateManager
participant Worker
participant Celery as process_pipeline_result
participant DB as Database
Client->>MLJob: run(job, images)
MLJob->>Flags: check async_pipeline_workers
alt async enabled
MLJob->>State: initialize_job(image_ids)
MLJob->>QueueMgr: queue_images_to_nats(job, images)
loop per batch
QueueMgr->>QueueMgr: publish_task(job_id, message)
end
else sync path
MLJob->>MLJob: process_images(job, images)
MLJob->>DB: persist results & progress
end
Note over Worker,QueueMgr: Worker reserves tasks from JetStream
Worker->>QueueMgr: reserve_task(job_id, batch)
Worker->>Celery: run pipeline, produce result + reply_subject
Celery->>State: update_state(processed_ids, "process", request_id)
Celery->>DB: save pipeline results
Celery->>QueueMgr: acknowledge_task(reply_subject)
Celery->>State: update_state(processed_ids, "results", request_id)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45–75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
.dockerignore(1 hunks).envs/.local/.django(1 hunks).gitignore(1 hunks).vscode/launch.json(1 hunks)README.md(1 hunks)ami/base/views.py(1 hunks)ami/jobs/models.py(8 hunks)ami/jobs/tasks.py(2 hunks)ami/jobs/views.py(3 hunks)ami/main/models.py(1 hunks)ami/ml/orchestration/jobs.py(1 hunks)ami/ml/orchestration/nats_queue.py(1 hunks)ami/ml/orchestration/task_state.py(1 hunks)ami/ml/orchestration/utils.py(1 hunks)ami/utils/requests.py(2 hunks)config/settings/base.py(2 hunks)docker-compose.yml(4 hunks)object_model_diagram.md(1 hunks)requirements/base.txt(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (8)
ami/ml/orchestration/nats_queue.py (1)
ami/jobs/views.py (1)
result(256-339)
ami/ml/orchestration/task_state.py (1)
ami/ml/orchestration/jobs.py (1)
cleanup(20-23)
ami/jobs/views.py (3)
ami/jobs/tasks.py (1)
process_pipeline_result(45-138)ami/jobs/models.py (4)
Job(727-1012)JobState(27-63)logger(997-1006)final_states(58-59)ami/ml/orchestration/nats_queue.py (2)
TaskQueueManager(28-294)reserve_task(152-208)
ami/jobs/tasks.py (5)
ami/ml/orchestration/nats_queue.py (2)
TaskQueueManager(28-294)acknowledge_task(210-229)ami/ml/orchestration/task_state.py (3)
TaskStateManager(17-97)mark_images_processed(48-61)get_progress(63-90)ami/ml/orchestration/utils.py (1)
run_in_async_loop(8-18)ami/jobs/models.py (5)
Job(727-1012)JobState(27-63)logger(997-1006)update_stage(168-188)save(947-958)ami/ml/models/pipeline.py (3)
save(1115-1121)save_results(809-917)save_results(1107-1108)
ami/ml/orchestration/jobs.py (4)
ami/jobs/models.py (2)
Job(727-1012)logger(997-1006)ami/ml/orchestration/nats_queue.py (3)
TaskQueueManager(28-294)cleanup_job_resources(278-294)publish_task(119-150)ami/ml/orchestration/task_state.py (3)
TaskStateManager(17-97)cleanup(92-97)initialize_job(38-46)ami/ml/orchestration/utils.py (1)
run_in_async_loop(8-18)
ami/ml/orchestration/utils.py (1)
ami/jobs/models.py (1)
logger(997-1006)
ami/base/views.py (1)
ami/main/api/views.py (1)
get(1595-1651)
ami/jobs/models.py (3)
ami/ml/orchestration/jobs.py (1)
queue_images_to_nats(28-107)ami/main/models.py (1)
SourceImage(1622-1870)ami/ml/models/pipeline.py (2)
process_images(163-278)process_images(1091-1105)
🪛 LanguageTool
object_model_diagram.md
[grammar] ~167-~167: Ensure spelling is correct
Context: ...ts 4. Job tracks progress through JobProgress and JobProgressStageDetail
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
object_model_diagram.md
15-15: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
31-31: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
38-38: Bare URL used
(MD034, no-bare-urls)
39-39: Bare URL used
(MD034, no-bare-urls)
40-40: Bare URL used
(MD034, no-bare-urls)
41-41: Bare URL used
(MD034, no-bare-urls)
42-42: Bare URL used
(MD034, no-bare-urls)
42-42: Bare URL used
(MD034, no-bare-urls)
43-43: Bare URL used
(MD034, no-bare-urls)
61-61: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
77-77: Bare URL used
(MD034, no-bare-urls)
97-97: Bare URL used
(MD034, no-bare-urls)
118-118: Code block style
Expected: fenced; Actual: indented
(MD046, code-block-style)
122-122: Code block style
Expected: fenced; Actual: indented
(MD046, code-block-style)
126-126: Code block style
Expected: fenced; Actual: indented
(MD046, code-block-style)
130-130: Code block style
Expected: fenced; Actual: indented
(MD046, code-block-style)
🪛 Ruff (0.14.2)
ami/ml/orchestration/nats_queue.py
70-70: Unused method argument: ttr
(ARG002)
73-73: Avoid specifying long messages outside the exception class
(TRY003)
81-81: Do not catch blind exception: Exception
(BLE001)
94-94: Avoid specifying long messages outside the exception class
(TRY003)
103-103: Do not catch blind exception: Exception
(BLE001)
132-132: Avoid specifying long messages outside the exception class
(TRY003)
146-146: Consider moving this statement to an else block
(TRY300)
148-148: Do not catch blind exception: Exception
(BLE001)
149-149: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
164-164: Avoid specifying long messages outside the exception class
(TRY003)
206-206: Do not catch blind exception: Exception
(BLE001)
207-207: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
221-221: Avoid specifying long messages outside the exception class
(TRY003)
226-226: Consider moving this statement to an else block
(TRY300)
227-227: Do not catch blind exception: Exception
(BLE001)
228-228: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
242-242: Avoid specifying long messages outside the exception class
(TRY003)
250-250: Consider moving this statement to an else block
(TRY300)
251-251: Do not catch blind exception: Exception
(BLE001)
252-252: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
266-266: Avoid specifying long messages outside the exception class
(TRY003)
273-273: Consider moving this statement to an else block
(TRY300)
274-274: Do not catch blind exception: Exception
(BLE001)
275-275: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/ml/orchestration/task_state.py
35-35: Unused blanket noqa directive
Remove unused noqa directive
(RUF100)
36-36: Unused blanket noqa directive
Remove unused noqa directive
(RUF100)
ami/jobs/views.py
33-43: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
73-73: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
218-218: Unused method argument: pk
(ARG002)
236-236: Avoid specifying long messages outside the exception class
(TRY003)
244-244: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
283-283: Avoid specifying long messages outside the exception class
(TRY003)
288-288: Avoid specifying long messages outside the exception class
(TRY003)
298-298: Avoid specifying long messages outside the exception class
(TRY003)
301-301: Avoid specifying long messages outside the exception class
(TRY003)
322-322: Do not catch blind exception: Exception
(BLE001)
323-323: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/jobs/tasks.py
45-45: Unused function argument: self
(ARG001)
120-120: Do not catch blind exception: Exception
(BLE001)
121-121: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
133-133: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
136-136: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
config/settings/base.py
28-28: Unused noqa directive (non-enabled: E231, E501)
Remove unused noqa directive
(RUF100)
ami/ml/orchestration/jobs.py
69-69: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
78-78: Do not catch blind exception: Exception
(BLE001)
ami/ml/orchestration/utils.py
14-14: Do not catch blind exception: Exception
(BLE001)
15-15: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/jobs/models.py
75-75: Unused blanket noqa directive
Remove unused noqa directive
(RUF100)
430-430: Unused blanket noqa directive
Remove unused noqa directive
(RUF100)
482-482: Unused blanket noqa directive
Remove unused noqa directive
(RUF100)
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
ami/jobs/tasks.py (1)
157-160: Use timezone-aware datetime forfinished_at.
datetime.now()returns a naive datetime. Django applications should use timezone-aware timestamps.+from django.utils import timezone + def _update_job_progress(job_id: int, stage: str, progress_percentage: float) -> None: from ami.jobs.models import Job, JobState # avoid circular import with transaction.atomic(): job = Job.objects.select_for_update().get(pk=job_id) job.progress.update_stage( stage, status=JobState.SUCCESS if progress_percentage >= 1.0 else JobState.STARTED, progress=progress_percentage, ) if stage == "results" and progress_percentage >= 1.0: job.status = JobState.SUCCESS job.progress.summary.status = JobState.SUCCESS - job.finished_at = datetime.now() + job.finished_at = timezone.now() job.logger.info(f"Updated job {job_id} progress in stage '{stage}' to {progress_percentage*100}%") job.save()ami/ml/orchestration/jobs.py (1)
89-92: Update both stages when the image list is empty.When
imagesis empty, only the "results" stage is marked as SUCCESS. The "process" stage should also be updated to maintain consistent job state.Apply this diff:
if not images: + job.progress.update_stage("process", status=JobState.SUCCESS, progress=1.0) job.progress.update_stage("results", status=JobState.SUCCESS, progress=1.0) job.save()
🧹 Nitpick comments (5)
ami/jobs/views.py (2)
241-248: Fix unused loop variable and consider a longer timeout.
- The loop variable
iis unused; rename to_.- The
timeout=0.1(100ms) is quite short for network operations. Consider a slightly longer timeout or making it configurable.async def get_tasks(): tasks = [] async with TaskQueueManager() as manager: - for i in range(batch): - task = await manager.reserve_task(job_id, timeout=0.1) + for _ in range(batch): + task = await manager.reserve_task(job_id, timeout=0.5) if task: tasks.append(task) return tasks
312-319: Uselogger.exceptionto capture the stack trace.When catching a broad exception, use
logger.exception()instead oflogger.error()to include the full traceback for debugging.except Exception as e: - logger.error(f"Failed to queue pipeline results for job {job_id}: {e}") + logger.exception(f"Failed to queue pipeline results for job {job_id}: {e}") return HttpResponseServerError( { "status": "error", "job_id": job_id, }, )ami/jobs/tasks.py (3)
100-103: Avoid usingassertfor runtime validation in production code.Assertions can be disabled with the
-Oflag, which would skip this check entirely. For runtime validation, use an explicit conditional with an appropriate exception.if pipeline_result: - # should never happen since otherwise we could not be processing results here - assert job.pipeline is not None, "Job pipeline is None" + if job.pipeline is None: + raise ValueError(f"Job {job_id} has no pipeline configured") job.pipeline.save_results(results=pipeline_result, job_id=job.pk)
123-125: Uselogger.exceptionto capture the full stack trace.When logging exceptions,
logger.exception()automatically includes the traceback, which aids debugging.except Exception as ack_error: - job.logger.error(f"Error acknowledging task via NATS: {ack_error}") + job.logger.exception(f"Error acknowledging task via NATS: {ack_error}") # Don't fail the task if ACK fails - data is already saved
138-144: Uselogger.exceptionfor error logging with stack traces.Both exception handlers should use
logger.exception()to capture full tracebacks.except Job.DoesNotExist: - logger.error(f"Job {job_id} not found") + logger.exception(f"Job {job_id} not found") raise except Exception as e: - logger.error(f"Failed to process pipeline result for job {job_id}: {e}") + logger.exception(f"Failed to process pipeline result for job {job_id}: {e}") # Celery will automatically retry based on autoretry_for raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
ami/jobs/tasks.py(3 hunks)ami/jobs/views.py(5 hunks)ami/ml/orchestration/jobs.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
ami/jobs/tasks.py (4)
ami/ml/orchestration/nats_queue.py (2)
TaskQueueManager(31-296)acknowledge_task(212-231)ami/ml/orchestration/task_state.py (2)
TaskStateManager(17-125)update_state(54-83)ami/jobs/models.py (3)
Job(734-1019)JobState(27-63)save(954-965)ami/ml/models/pipeline.py (3)
save_results(814-922)save_results(1120-1121)save(1128-1134)
ami/jobs/views.py (3)
ami/jobs/tasks.py (1)
process_pipeline_result(49-144)ami/ml/schemas.py (1)
PipelineTaskResult(232-238)ami/ml/orchestration/nats_queue.py (2)
TaskQueueManager(31-296)reserve_task(154-210)
ami/ml/orchestration/jobs.py (3)
ami/jobs/models.py (3)
Job(734-1019)JobState(27-63)logger(1004-1013)ami/ml/orchestration/nats_queue.py (3)
TaskQueueManager(31-296)cleanup_job_resources(280-296)publish_task(122-152)ami/ml/orchestration/task_state.py (3)
TaskStateManager(17-125)cleanup(119-125)initialize_job(39-49)
🪛 Ruff (0.14.8)
ami/jobs/tasks.py
123-123: Do not catch blind exception: Exception
(BLE001)
124-124: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
135-135: Abstract raise to an inner function
(TRY301)
139-139: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
142-142: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/jobs/views.py
244-244: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
256-256: Unused method argument: pk
(ARG002)
310-310: Avoid specifying long messages outside the exception class
(TRY003)
312-312: Do not catch blind exception: Exception
(BLE001)
313-313: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/ml/orchestration/jobs.py
76-76: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (8)
ami/ml/orchestration/jobs.py (2)
76-78: Catching broadExceptionis acceptable here for fault tolerance.In an async context interacting with external NATS infrastructure, catching a broad exception to log and continue processing remaining images is reasonable. The success/failure is tracked per-image via
failed_queues.
28-101: Implementation looks solid overall.The function correctly:
- Prepares messages outside async context to avoid Django ORM issues
- Initializes Redis state tracking before queueing
- Tracks successes/failures per image
- Uses timezone-aware timestamps
ami/jobs/views.py (3)
34-49: JobFilterSet implementation looks good.Clean filter implementation with proper field configuration.
274-308: Result endpoint implementation is well-structured.The endpoint correctly:
- Handles both list and single-item input
- Validates with Pydantic
- Queues tasks asynchronously via Celery
- Returns proper response with task tracking info
228-228: Verifyjob_idtype consistency with TaskQueueManager.
job.pkis an integer, but it's passed directly tomanager.reserve_task(job_id, ...). Looking atTaskQueueManager, it expectsjob_id: strfor methods likereserve_task. This works due to Python's dynamic typing, but the inconsistency could cause issues.Consider converting explicitly:
- job_id = job.pk + job_id = str(job.pk)#!/bin/bash # Verify TaskQueueManager method signatures expect string job_id ast-grep --pattern 'async def reserve_task(self, job_id: str$_)' rg -n "def reserve_task" --type=py -A2ami/jobs/tasks.py (3)
41-48: Task configuration looks appropriate.Good choices for retry and timeout configuration:
max_retries=3with 60s delay provides reasonable retry behavior- 5-minute soft limit with 6-minute hard limit gives buffer for cleanup
autoretry_for=(Exception,)ensures transient failures are retried
203-225: Clever timing helper implementation.The
log_timehelper usingfunctools.partialfor chained timing is elegant and reusable.
147-163: Job completion logic is well-structured.The atomic transaction with
select_for_updateproperly handles concurrent updates. The completion detection based on the "results" stage reaching 100% is appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
ami/ml/orchestration/nats_queue.py (1)
104-121: TTR mismatch risk when consumer already exists.If a consumer was previously created with a different
ack_waitvalue, this code silently reuses it without verifying the configuration matchesTASK_TTR. JetStream doesn't allow mutatingack_waiton an existing durable consumer, which could lead to unexpected redelivery behavior if the TTR expectation changes.Consider detecting mismatched
ack_waitand recreating the consumer, or at minimum logging a warning when the existing consumer's configuration differs.
🧹 Nitpick comments (9)
ami/jobs/tasks.py (2)
111-125: Uselogger.exceptionfor better error diagnostics.When catching exceptions,
logger.exceptionautomatically includes the stack trace, which aids debugging. This aligns with the Ruff TRY400 hint.except Exception as ack_error: - job.logger.error(f"Error acknowledging task via NATS: {ack_error}") + job.logger.exception(f"Error acknowledging task via NATS: {ack_error}") # Don't fail the task if ACK fails - data is already saved
138-144: Uselogger.exceptionfor better error diagnostics.Similar to the NATS ack error handling, using
logger.exceptionhere will include the full stack trace for debugging.except Job.DoesNotExist: - logger.error(f"Job {job_id} not found") + logger.exception(f"Job {job_id} not found") raise except Exception as e: - logger.error(f"Failed to process pipeline result for job {job_id}: {e}") + logger.exception(f"Failed to process pipeline result for job {job_id}: {e}") # Celery will automatically retry based on autoretry_for raiseami/ml/orchestration/jobs.py (3)
17-18: Consider adding return type to docstring.The function signature shows
-> boolbut the docstring doesn't document the return value.""" Clean up NATS JetStream resources (stream and consumer) for a completed job. Args: job: The Job instance + + Returns: + bool: True if cleanup was successful, False otherwise """
76-78: Uselogger.exceptionfor better error diagnostics.When catching exceptions in the queue loop, using
logger.exceptionwill include the stack trace.except Exception as e: - logger.error(f"Failed to queue image {image_pk} to stream for job '{job_id}': {e}") + logger.exception(f"Failed to queue image {image_pk} to stream for job '{job_id}': {e}") success = False
87-92: Empty images check occurs after queueing attempt.The
if not imagescheck on line 89 happens afterasync_to_sync(queue_all_images)()has already executed. Ifimageswas empty,taskswould also be empty, and the async function would return(0, 0). This works but the check should ideally be earlier to avoid unnecessary async overhead.+ if not images: + job.progress.update_stage("process", status=JobState.SUCCESS, progress=1.0) + job.progress.update_stage("results", status=JobState.SUCCESS, progress=1.0) + job.save() + job.logger.info(f"No images to queue for job '{job_id}'") + return True + # Prepare all messages outside of async context to avoid Django ORM issues tasks: list[tuple[int, PipelineProcessingTask]] = [] ... successful_queues, failed_queues = async_to_sync(queue_all_images)() - if not images: - job.progress.update_stage("process", status=JobState.SUCCESS, progress=1.0) - job.progress.update_stage("results", status=JobState.SUCCESS, progress=1.0) - job.save()ami/jobs/views.py (3)
228-228: Consider passingjob_idas string consistently.The
job_idis assigned asjob.pk(likely an integer), butTaskQueueManager.reserve_taskexpects a stringjob_id. While Python will handle the conversion in f-strings internally, it's cleaner to be explicit.- job_id = job.pk + job_id = str(job.pk)
241-251: Rename unused loop variable and consider batch fetching optimization.The loop variable
iis unused (Ruff B007). Also, callingreserve_tasksequentially in a loop with short timeouts may be inefficient. Consider fetching multiple messages in a single pull if the NATS client supports it.async def get_tasks(): tasks = [] async with TaskQueueManager() as manager: - for i in range(batch): + for _ in range(batch): task = await manager.reserve_task(job_id, timeout=0.1) if task: tasks.append(task.dict()) return tasks
312-319: Uselogger.exceptionand avoid returning 500 directly.Using
logger.exceptionincludes the stack trace. Also, consider using DRF's proper exception handling rather than returningstatus=500directly.except Exception as e: - logger.error(f"Failed to queue pipeline results for job {job_id}: {e}") + logger.exception(f"Failed to queue pipeline results for job {job_id}: {e}") return Response( { "status": "error", "job_id": job_id, + "message": str(e), }, status=500, )ami/ml/orchestration/nats_queue.py (1)
144-144: Replacedict()withmodel_dump()for Pydantic v2 compatibility.In Pydantic v2,
dict()is deprecated in favor ofmodel_dump(). Update line 144 fromjson.dumps(data.dict())tojson.dumps(data.model_dump()).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
ami/jobs/tasks.py(3 hunks)ami/jobs/views.py(5 hunks)ami/ml/orchestration/jobs.py(1 hunks)ami/ml/orchestration/nats_queue.py(1 hunks)ami/ml/schemas.py(0 hunks)
💤 Files with no reviewable changes (1)
- ami/ml/schemas.py
🧰 Additional context used
🧬 Code graph analysis (3)
ami/jobs/views.py (3)
ami/jobs/tasks.py (1)
process_pipeline_result(49-144)ami/ml/schemas.py (1)
PipelineTaskResult(231-237)ami/ml/orchestration/nats_queue.py (2)
TaskQueueManager(32-297)reserve_task(156-211)
ami/ml/orchestration/nats_queue.py (1)
ami/ml/schemas.py (1)
PipelineProcessingTask(217-228)
ami/jobs/tasks.py (3)
ami/ml/orchestration/nats_queue.py (2)
TaskQueueManager(32-297)acknowledge_task(213-232)ami/ml/orchestration/task_state.py (2)
TaskStateManager(17-125)update_state(54-83)ami/ml/schemas.py (1)
PipelineResultsResponse(199-214)
🪛 Ruff (0.14.8)
ami/jobs/views.py
244-244: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
256-256: Unused method argument: pk
(ARG002)
310-310: Avoid specifying long messages outside the exception class
(TRY003)
312-312: Do not catch blind exception: Exception
(BLE001)
313-313: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/ml/orchestration/nats_queue.py
77-77: Avoid specifying long messages outside the exception class
(TRY003)
85-85: Do not catch blind exception: Exception
(BLE001)
98-98: Avoid specifying long messages outside the exception class
(TRY003)
107-107: Do not catch blind exception: Exception
(BLE001)
135-135: Avoid specifying long messages outside the exception class
(TRY003)
150-150: Consider moving this statement to an else block
(TRY300)
152-152: Do not catch blind exception: Exception
(BLE001)
153-153: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
168-168: Avoid specifying long messages outside the exception class
(TRY003)
209-209: Do not catch blind exception: Exception
(BLE001)
210-210: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
224-224: Avoid specifying long messages outside the exception class
(TRY003)
229-229: Consider moving this statement to an else block
(TRY300)
230-230: Do not catch blind exception: Exception
(BLE001)
231-231: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
245-245: Avoid specifying long messages outside the exception class
(TRY003)
253-253: Consider moving this statement to an else block
(TRY300)
254-254: Do not catch blind exception: Exception
(BLE001)
255-255: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
269-269: Avoid specifying long messages outside the exception class
(TRY003)
276-276: Consider moving this statement to an else block
(TRY300)
277-277: Do not catch blind exception: Exception
(BLE001)
278-278: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/jobs/tasks.py
123-123: Do not catch blind exception: Exception
(BLE001)
124-124: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
135-135: Abstract raise to an inner function
(TRY301)
139-139: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
142-142: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/ml/orchestration/jobs.py
76-76: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (17)
ami/jobs/tasks.py (5)
1-17: LGTM!Import organization is clean. Good use of
timezonefrom Django utils (addressing prior feedback) and proper separation of concerns with the new orchestration imports.
41-63: LGTM!Task configuration is well-structured with appropriate retry settings. The 5/6 minute soft/hard time limits align with the lock timeout in
TaskStateManager(360 seconds). Docstring is now accurate (addresses prior feedback about return type).
72-75: LGTM!Good fix for marking failed images as processed to prevent job hanging (addresses prior feedback). The conditional
{str(image_id)} if image_id else set()properly handles cases whereimage_idmight be missing.
147-162: LGTM!Good use of
transaction.atomic()withselect_for_update()for safe concurrent updates. The job completion logic (settingfinished_atwithtimezone.now()) properly addresses prior feedback. The conditional status transition whenstage == "results"andprogress_percentage >= 1.0correctly handles job lifecycle termination.
203-225: LGTM!Clever timing helper using
functools.partialfor chained measurements. The pattern of returning both the duration and a callable for the next measurement is elegant and readable.ami/ml/orchestration/nats_queue.py (6)
1-27: LGTM!Clean module structure with good documentation. The
get_connectionhelper properly establishes both NATS and JetStream contexts.
43-60: LGTM!Proper async context manager implementation. The
__aexit__correctly handles cleanup by nullifying references and closing the connection only if not already closed.
123-154: LGTM!The
publish_taskmethod correctly ensures stream and consumer exist before publishing. Usingdata.dict()for Pydantic serialization is appropriate.
156-211: LGTM!The
reserve_taskimplementation is well-structured:
- Properly creates ephemeral subscription for pull
- Handles timeout gracefully with
nats.errors.TimeoutError- Always unsubscribes in
finallyblock- Returns parsed
PipelineProcessingTaskwithreply_subjectattached
213-232: LGTM!Simple and effective acknowledgment via reply subject. The
+ACKprotocol message is correct for JetStream acknowledgment.
281-297: LGTM!The
cleanup_job_resourcesmethod correctly deletes consumer before stream (proper order to avoid orphaned resources).ami/ml/orchestration/jobs.py (3)
10-25: LGTM!Good implementation. The function now correctly returns the cleanup result (addressing prior feedback), allowing callers to handle success/failure.
42-57: LGTM!Good defensive coding:
- Validates image URL before queuing
- Logs warning for skipped images
- Properly builds
PipelineProcessingTaskobjects outside async context to avoid Django ORM issues
89-92: LGTM!Good fix addressing prior feedback - both "process" and "results" stages are now updated when the images list is empty, ensuring consistent job state.
ami/jobs/views.py (3)
33-48: LGTM!Clean filterset implementation enabling pipeline filtering by slug.
255-272: LGTM!Good handling of both single result and list of results. The endpoint correctly validates and normalizes input.
274-299: LGTM!Good implementation of result queuing:
- Each result is validated with Pydantic
- Background task is properly dispatched via Celery
.delay()- Response includes task IDs for tracking
- Logging provides good observability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
ami/ml/orchestration/nats_queue.py (1)
104-121: TTR mismatch when consumer already exists remains unaddressed.Per the past review comment, if
reserve_task()creates the consumer beforepublish_task(), the consumer uses the defaultTASK_TTR. If different TTR values are needed later, the existing consumer configuration won't be updated since JetStream doesn't allow mutatingack_waiton an existing durable consumer. This can cause premature redeliveries.The suggested fix from the past review was to recreate the consumer when the configured
ack_waitdiffers from the requested TTR. Consider implementing that logic or documenting that a single TTR value is always used.
🧹 Nitpick comments (5)
ami/ml/orchestration/jobs.py (2)
44-50: Avoid duplicateurl()call.
image.url()is called twice: once in the condition check and again for assignment. If this method performs any computation or I/O, this is wasteful.for image in images: image_id = str(image.pk) - image_url = image.url() if hasattr(image, "url") and image.url() else "" + image_url = image.url() if hasattr(image, "url") else "" if not image_url: job.logger.warning(f"Image {image.pk} has no URL, skipping queuing to NATS for job '{job.pk}'") continue
68-77: Inconsistent logger usage inside async context.Lines 70 and 76 use the module-level
logger, while the rest of the function usesjob.logger. This inconsistency can make log correlation difficult when debugging job-specific issues.for image_pk, task in tasks: try: - logger.info(f"Queueing image {image_pk} to stream for job '{job.pk}': {task.image_url}") + job.logger.info(f"Queueing image {image_pk} to stream for job '{job.pk}': {task.image_url}") success = await manager.publish_task( job_id=job.pk, data=task, ) except Exception as e: - logger.error(f"Failed to queue image {image_pk} to stream for job '{job.pk}': {e}") + job.logger.error(f"Failed to queue image {image_pk} to stream for job '{job.pk}': {e}") success = Falseami/jobs/views.py (2)
299-306: Simplify redundant list comprehension.All items in
queued_tasksalready havestatus: "queued"(set on line 289), making the filter unnecessary.return Response( { "status": "accepted", "job_id": job.pk, - "results_queued": len([t for t in queued_tasks if t["status"] == "queued"]), + "results_queued": len(queued_tasks), "tasks": queued_tasks, } )
310-317: Uselogger.exception()to capture traceback.
logger.exception()automatically includes the traceback, which aids debugging when errors occur in production.except Exception as e: - logger.error(f"Failed to queue pipeline results for job {job.pk}: {e}") + logger.exception(f"Failed to queue pipeline results for job {job.pk}: {e}") return Response( { "status": "error", "job_id": job.pk, }, status=500, )ami/ml/orchestration/nats_queue.py (1)
143-144: Pydantic v1 syntax used.
.dict()is Pydantic v1 syntax. For Pydantic v2 compatibility, use.model_dump(). This applies throughout the codebase where Pydantic models are serialized.# Convert Pydantic model to JSON - task_data = json.dumps(data.dict()) + task_data = json.dumps(data.model_dump())
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
ami/jobs/tests.py(3 hunks)ami/jobs/views.py(3 hunks)ami/ml/orchestration/jobs.py(1 hunks)ami/ml/orchestration/nats_queue.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
ami/ml/orchestration/nats_queue.py (1)
ami/ml/schemas.py (1)
PipelineProcessingTask(217-228)
ami/jobs/views.py (3)
ami/jobs/tasks.py (1)
process_pipeline_result(49-144)ami/ml/schemas.py (1)
PipelineTaskResult(231-237)ami/ml/orchestration/nats_queue.py (2)
TaskQueueManager(32-297)reserve_task(156-211)
🪛 Ruff (0.14.8)
ami/ml/orchestration/nats_queue.py
77-77: Avoid specifying long messages outside the exception class
(TRY003)
85-85: Do not catch blind exception: Exception
(BLE001)
98-98: Avoid specifying long messages outside the exception class
(TRY003)
107-107: Do not catch blind exception: Exception
(BLE001)
135-135: Avoid specifying long messages outside the exception class
(TRY003)
150-150: Consider moving this statement to an else block
(TRY300)
152-152: Do not catch blind exception: Exception
(BLE001)
153-153: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
168-168: Avoid specifying long messages outside the exception class
(TRY003)
209-209: Do not catch blind exception: Exception
(BLE001)
210-210: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
224-224: Avoid specifying long messages outside the exception class
(TRY003)
229-229: Consider moving this statement to an else block
(TRY300)
230-230: Do not catch blind exception: Exception
(BLE001)
231-231: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
245-245: Avoid specifying long messages outside the exception class
(TRY003)
253-253: Consider moving this statement to an else block
(TRY300)
254-254: Do not catch blind exception: Exception
(BLE001)
255-255: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
269-269: Avoid specifying long messages outside the exception class
(TRY003)
276-276: Consider moving this statement to an else block
(TRY300)
277-277: Do not catch blind exception: Exception
(BLE001)
278-278: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/ml/orchestration/jobs.py
75-75: Do not catch blind exception: Exception
(BLE001)
ami/jobs/views.py
255-255: Unused method argument: pk
(ARG002)
308-308: Avoid specifying long messages outside the exception class
(TRY003)
310-310: Do not catch blind exception: Exception
(BLE001)
311-311: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (7)
ami/jobs/tests.py (1)
397-399: LGTM!The updated assertions correctly reflect the new response structure from the result endpoint, matching the implementation in
views.py.ami/ml/orchestration/jobs.py (2)
11-25: LGTM!The cleanup function properly returns the boolean result from the async cleanup operation, addressing the earlier feedback.
86-102: LGTM!The post-queueing logic correctly handles both the empty tasks case (updating both stages) and properly reports success/failure based on queue outcomes.
ami/ml/orchestration/nats_queue.py (3)
181-207: LGTM!The task reservation logic correctly creates a pull subscription, handles timeouts gracefully, and ensures cleanup via
finallyblock.
213-232: LGTM!The acknowledgment implementation correctly publishes the ACK message to the reply subject.
281-297: LGTM!The cleanup logic correctly deletes the consumer before the stream and returns a combined success indicator.
ami/jobs/views.py (1)
237-252: LGTM with note on Pydantic version compatibility.The NATS task retrieval implementation looks correct. If this project uses or will upgrade to Pydantic v2, the
.dict()call (line 246) should be replaced with.model_dump()as part of migration—.dict()is deprecated in Pydantic v2. For this simple no-argument usage, the replacement is straightforward:task.model_dump().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
ami/jobs/tasks.py (1)
160-160: Use timezone-aware datetime for consistency with Django settings.The code uses
datetime.now()which returns a naive datetime. The inline comment suggests this is intentional ("Use naive datetime in local time"), but Django applications withUSE_TZ=Trueshould use timezone-aware timestamps to avoid timezone-related bugs.Apply this diff:
def _update_job_progress(job_id: int, stage: str, progress_percentage: float) -> None: from ami.jobs.models import Job, JobState # avoid circular import + from django.utils import timezone with transaction.atomic(): job = Job.objects.select_for_update().get(pk=job_id) job.progress.update_stage( stage, status=JobState.SUCCESS if progress_percentage >= 1.0 else JobState.STARTED, progress=progress_percentage, ) if stage == "results" and progress_percentage >= 1.0: job.status = JobState.SUCCESS job.progress.summary.status = JobState.SUCCESS - job.finished_at = datetime.now() # Use naive datetime in local time + job.finished_at = timezone.now() job.logger.info(f"Updated job {job_id} progress in stage '{stage}' to {progress_percentage*100}%") job.save()
🧹 Nitpick comments (4)
ami/jobs/tasks.py (3)
45-45: Consider narrowing the autoretry exception types.The task uses
autoretry_for=(Exception,)which will retry on any exception. This may mask distinct failure modes (e.g., validation errors vs. transient network issues) that should be handled differently. Consider specifying more targeted exceptions or usingretry_forwith explicit exception types.
123-124: Improve exception handling for NATS acknowledgment.The code catches a bare
Exceptionand logs vialogging.errorwithout the traceback. This makes debugging NATS acknowledgment failures difficult.Apply this diff to use
logging.exceptionfor automatic traceback logging:except Exception as ack_error: - job.logger.error(f"Error acknowledging task via NATS: {ack_error}") + job.logger.exception(f"Error acknowledging task via NATS: {ack_error}") # Don't fail the task if ACK fails - data is already savedBased on static analysis hints.
138-144: Use logging.exception for better error diagnostics.The exception handlers at lines 139 and 142 use
logging.errorwhich doesn't include tracebacks. This makes troubleshooting production issues harder.Apply this diff:
except Job.DoesNotExist: - logger.error(f"Job {job_id} not found") + logger.exception(f"Job {job_id} not found") raise except Exception as e: - logger.error(f"Failed to process pipeline result for job {job_id}: {e}") + logger.exception(f"Failed to process pipeline result for job {job_id}: {e}") # Celery will automatically retry based on autoretry_for raiseBased on static analysis hints.
docker-compose.ci.yml (1)
43-55: Consider renaming container for CI consistency.The NATS container is named
ami_local_natsin a CI compose file. This could cause conflicts if both local and CI stacks run simultaneously, and doesn't follow the naming pattern of other CI resources (e.g.,ami_ci_postgres_data).Consider changing to
ami_ci_nats:nats: image: nats:2.10-alpine - container_name: ami_local_nats + container_name: ami_ci_nats hostname: natsAlternatively, remove the
container_namedirective entirely and let docker-compose auto-generate it with the stack name prefix.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.envs/.ci/.django(1 hunks).envs/.production/.django-example(1 hunks)ami/jobs/tasks.py(3 hunks)ami/jobs/tests.py(3 hunks)docker-compose.ci.yml(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- .envs/.ci/.django
🧰 Additional context used
🧬 Code graph analysis (2)
ami/jobs/tasks.py (4)
ami/ml/orchestration/nats_queue.py (2)
TaskQueueManager(32-297)acknowledge_task(213-232)ami/ml/orchestration/task_state.py (2)
TaskStateManager(17-125)update_state(54-83)ami/ml/schemas.py (2)
PipelineResultsResponse(199-214)summary(179-196)ami/jobs/models.py (4)
Job(734-1019)logger(1004-1013)retry(873-885)save(954-965)
ami/jobs/tests.py (2)
ami/ml/orchestration/jobs.py (1)
queue_images_to_nats(28-102)ami/main/models.py (1)
SourceImage(1666-1914)
🪛 Ruff (0.14.8)
ami/jobs/tasks.py
123-123: Do not catch blind exception: Exception
(BLE001)
124-124: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
135-135: Abstract raise to an inner function
(TRY301)
139-139: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
142-142: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (7)
ami/jobs/tasks.py (2)
203-225: LGTM!The
log_timehelper provides a clean, functional approach to timing code blocks. The use oftime.perf_counter()is appropriate for performance measurements, and the docstring clearly explains the usage pattern.
1-18: LGTM!The imports are well-organized and appropriate for the new functionality. The use of local imports inside functions (e.g.,
from ami.jobs.models import Job) to avoid circular dependencies is a good practice..envs/.production/.django-example (1)
69-71: LGTM!The NATS URL configuration is consistent with the NATS service added to docker-compose and matches the pattern used in local and CI environments.
ami/jobs/tests.py (2)
330-338: LGTM!The test now correctly creates and saves
SourceImageinstances before passing them toqueue_images_to_nats. This ensures each image has a valid primary key, addressing the previous concern about unsaved objects withpk=None.
403-405: LGTM!The updated assertions correctly reflect the new asynchronous result processing workflow where results are queued to Celery tasks rather than processed synchronously.
docker-compose.ci.yml (2)
25-25: LGTM!Adding NATS to Django's dependencies is correct, as the new task queue and job processing functionality requires NATS JetStream to be available.
50-50: JetStream and monitoring flags are correctly configured.The NATS service configuration uses
-jsto enable JetStream and-m 8222to start the HTTP monitoring server on the conventional port. Both flags are supported in NATS 2.10 and this combination is the standard way to enable these features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
docker-compose.ci.yml (1)
19-25: Consider adding explicit health condition todepends_onfor NATS.The django service now depends on nats, but the dependency lacks
condition: service_healthy, unlike the minio-init service (line 84). This means Django may start before NATS is fully initialized and healthy, potentially causing early connection failures.If the application already has robust retry/backoff logic for NATS connections, this is acceptable. Otherwise, consider updating to ensure NATS is healthy:
depends_on: - postgres - redis - minio-init - ml_backend - rabbitmq - - nats + nats: + condition: service_healthyami/ml/tests.py (3)
860-884: Solid coverage; consider relying more on the public API than_get_progressThis suite gives good coverage of initialization, per-stage tracking, empty jobs, and cleanup. The one downside is the heavy reliance on the private
_get_progresshelper, which couples tests to implementation details and may make refactors noisier. Where feasible, prefer going throughupdate_state(including fortest_progress_trackingandtest_stages_independent) and reserve_get_progresscalls for very targeted cases where you intentionally want to test that internal behavior.
895-918: Float equality assertions could useassertAlmostEqualfor robustness
test_progress_trackingasserts exact equality on floating-point percentages (0.4, 0.8, 1.0). This works today but is a bit brittle to minor implementation changes (e.g., if total is computed differently). UsingassertAlmostEqual(progress.percentage, 0.4, places=6)(and similarly for other percentages) would make these tests more resilient while preserving intent.
920-944: Locking behavior is covered; optional extra check for lock lifecycle
test_update_state_with_lockingnicely validates that a held lock causesupdate_stateto no-op and that progress resumes after the lock is cleared. If you want to harden this further, you could also assert thatupdate_statedoes not inadvertently modify the lock when it’s held by another task (e.g., confirm the lock key/value is unchanged after the failed update) to guard against future changes to the lock-release logic.ami/ml/orchestration/test_nats_queue.py (5)
10-38: Good structure and mocking helpers; small clarity tweaks possibleThe use of
IsolatedAsyncioTestCaseplus_create_sample_taskand_create_mock_nats_connectionkeeps the async tests readable and DRY. One minor improvement would be to document in_create_mock_nats_connectionwhich subset of NATS/JetStream APIs you rely on (e.g., in a comment) so future changes toTaskQueueManagerknow what needs to be mocked or extended here.
39-49: Context manager lifecycle test is solid; optionally assert__aenter__/__aexit__effects more strictlyThe lifecycle test correctly verifies that a connection is established and closed. If you want this to catch more regressions, you could also assert that
manager.nc/manager.jsareNoneafter the context exits, and optionally usenc.close.assert_awaited_once()to emphasize the async nature of the close call.
50-64: Publish-path test works but could assert arguments more preciselyRight now the test checks
add_streamindirectly via a substring search oncall_args. To make this more robust, consider asserting on the actual kwargs used (e.g., stream name and subjects) instead of relying onstr(js.add_stream.call_args), which is more brittle to representation changes of mocks.
106-117: ACK behavior test is clear; consider also covering failure/no-connection caseThe ACK test correctly ensures the manager publishes
b"+ACK"to the supplied reply subject. As a follow-up, you might also add a negative test mirroringtest_operations_without_connection_raise_errorto ensureacknowledge_taskalso raises when no connection is open, keeping behavior consistent across APIs.
118-129: Cleanup test validates both consumer and stream deletion
test_cleanup_job_resourcesconfirms that bothdelete_consumeranddelete_streamare invoked and that the method returnsTrue, which is the key behavior here. If you later add error handling (e.g., ignore “not found”), consider extending this test to cover that branch as well.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
ami/ml/orchestration/test_nats_queue.py(1 hunks)ami/ml/tests.py(1 hunks)docker-compose.ci.yml(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
ami/ml/tests.py (1)
ami/ml/orchestration/task_state.py (4)
TaskStateManager(17-125)initialize_job(39-49)_get_progress(85-117)update_state(54-83)
ami/ml/orchestration/test_nats_queue.py (1)
ami/ml/schemas.py (1)
PipelineProcessingTask(217-228)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (7)
docker-compose.ci.yml (2)
43-52: NATS service configuration is well-suited for CI testing.The NATS service is properly configured with:
- Pinned image version (nats:2.10-alpine) for reproducibility
- Correct hostname (nats) matching the NATS_URL in environment config
- JetStream enabled (-js flag) for the task queue feature
- Monitoring port exposed (-m 8222) with appropriate healthcheck
- Reasonable healthcheck parameters (10s interval, 5s timeout, 3 retries)
73-73: Minio healthcheck formatting improvement.The reformatting from single-quoted string to explicit array syntax improves consistency with other healthchecks in the file (NATS, etc.) and aligns with docker-compose best practices for the healthcheck
testdirective.ami/ml/tests.py (1)
960-980: Empty-job and cleanup behavior look correctThe expectations for empty jobs (immediately 100% complete) and for
cleanupreturningNoneprogress afterwards are well aligned withTaskStateManager’s semantics. No changes needed here.ami/ml/orchestration/test_nats_queue.py (4)
65-89: Reservation success path is well coveredThe
reserve_taskhappy path is nicely exercised: deserialization,reply_subjectpropagation, andunsubscribeare all asserted. This aligns well with the intendedPipelineProcessingTaskcontract; no changes needed here.
90-105: No-message reservation behavior looks correct
test_reserve_task_no_messagesappropriately verifies theNonereturn and that the pull subscription is still cleaned up viaunsubscribe(). That matches the desired contract for idle queues.
130-137: Naming convention tests nicely lock down per-job resource namesThe expectations for stream (
job_123), subject (job.123.tasks), and consumer (job-123-consumer) names are explicit and make future refactors safer. This is a good use of tests against private helpers.
138-150: Error-on-missing-connection behavior is correctly asserted
test_operations_without_connection_raise_errorensures that key operations fail fast with a clearRuntimeErrorwhen the manager isn’t entered. That’s important guardrail behavior for callers and looks correct as-is.
|
@mihow I'm not addressing two of coderabbit comments:
|
👷 Deploy request for antenna-ssec pending review.Visit the deploys page to approve it
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
docker-compose.staging.yml (2)
20-23: Fix indentation for the commented NATS dependency.The commented
# - natshas extra indentation compared to the other list items. While YAML allows comments at any indentation, this is inconsistent and could be confusing.🔎 Proposed fix
depends_on: - postgres - redis - # - nats + # - nats
69-82: Consider JetStream persistence and container naming.Two observations:
No JetStream persistence: Without a volume mount for JetStream storage, streams/consumers and any pending messages will be lost on container restart. This may be acceptable for staging/testing, but consider adding persistence if jobs shouldn't be lost across restarts:
volumes: - nats_data:/data command: ["-js", "-m", "8222", "-sd", "/data"]Container name inconsistency:
ami_local_natsuses "local" prefix but this is the staging compose file. Considerami_staging_natsfor consistency.Please confirm whether JetStream persistence is intentionally omitted for staging, or if it should be added to preserve task queues across container restarts.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ami/jobs/tasks.py(3 hunks)docker-compose.staging.yml(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
ami/jobs/tasks.py (5)
ami/ml/orchestration/nats_queue.py (2)
TaskQueueManager(32-297)acknowledge_task(213-232)ami/ml/orchestration/task_state.py (2)
TaskStateManager(17-125)update_state(54-83)ami/ml/schemas.py (2)
PipelineResultsResponse(199-214)summary(179-196)ami/jobs/models.py (5)
Job(734-1019)logger(1004-1013)retry(873-885)JobState(27-63)save(954-965)ami/ml/models/pipeline.py (3)
save_results(814-922)save_results(1120-1121)save(1128-1134)
🪛 Ruff (0.14.8)
ami/jobs/tasks.py
98-98: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
124-124: Abstract raise to an inner function
(TRY301)
127-127: Do not catch blind exception: Exception
(BLE001)
128-128: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
144-144: Do not catch blind exception: Exception
(BLE001)
145-145: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (4)
docker-compose.staging.yml (1)
32-50: LGTM!Adding
restart: alwaysto the core services (django, postgres, redis) improves reliability for the staging environment.ami/jobs/tasks.py (3)
131-147: LGTM - Error handling strategy is appropriate.The helper correctly wraps the async NATS acknowledgment and handles failures gracefully without failing the parent task. The comment on line 146 clearly explains the design decision: if data is already saved, it's better to log ACK failures and let NATS redeliver than to fail the entire task.
205-227: LGTM - Clean timing utility.The implementation is straightforward and correct. Using
time.perf_counter()is appropriate for measuring elapsed time, and the partial function pattern allows convenient chaining of timing measurements throughout the task execution.
149-165: Verified:USE_TZ=Falseis configured—no changes needed.The project's Django settings explicitly set
USE_TZ = Falsewith a comment noting "All time are in local time." With this configuration,datetime.datetime.now()on line 162 is the correct approach and is used consistently throughout the codebase (20+ instances). The explicit comment in the code confirming "Use naive datetime in local time" aligns with the project's timezone strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @docker-compose.staging.yml:
- Around line 77-81: Healthcheck uses wget which isn't present in the
nats:2.10-alpine image causing false unhealthy status; update the healthcheck
"test" in the healthcheck block to use a simple shell TCP probe (via sh and
/dev/tcp) or use the CMD-SHELL form to attempt a TCP connection to
localhost:8222/healthz and return success/failure, ensuring it works with the
minimal sh-only Alpine runtime.
🧹 Nitpick comments (1)
docker-compose.staging.yml (1)
20-23: Consider uncommenting the NATS dependency or adding a note.The
natsdependency is commented out, meaning Django and Celery workers could start before NATS is ready. When theasync_pipeline_workersfeature flag is enabled for a project, this could cause transient connection failures during startup. If the application code includes retry logic for NATS connections, this is acceptable; otherwise, consider uncommenting the dependency.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docker-compose.staging.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (2)
docker-compose.staging.yml (2)
32-32: LGTM!Adding
restart: alwaysto core services is appropriate for staging environments and ensures resilience against transient failures.Also applies to: 46-46, 50-50
69-76: NATS service configuration looks good.The JetStream setup with
-jsflag, appropriate port mappings, and hostname configuration are correct for the PR's task queue requirements.Also applies to: 82-82
|
@mihow review the errors we are getting when calling TaxaList.get_or_create() (TaxaList already exists) |
| task queuing with acknowledgment support via reply subjects. This allows workers | ||
| to pull tasks over HTTP and acknowledge them later without maintaining a persistent | ||
| connection to NATS. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihow - added this comment on the selection of NATS over Rabbit and Beanstalk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ami/jobs/views.py (1)
226-250: Add maximum batch size constraint to prevent resource exhaustion.The
batchparameter has no upper bound, allowing users to request arbitrarily large numbers of tasks. A single request with a high batch value ties up the request thread (viaasync_to_sync), and multiple concurrent requests could exhaust thread resources. Add a configurable max cap.
🤖 Fix all issues with AI agents
In `@ami/jobs/views.py`:
- Around line 270-304: The current code validates and enqueues each item as it
iterates over results (using PipelineTaskResult and
process_pipeline_result.delay), which can dispatch partial work if a later item
fails; change the flow to first validate all items without calling
process_pipeline_result.delay (e.g., iterate results and construct/validate
PipelineTaskResult objects into a temporary list), and only after every item
validates successfully iterate that temporary list to call
process_pipeline_result.delay and populate queued_tasks for the response (keep
using job.pk, reply_subject and result_data.dict() when enqueuing).
In `@ami/ml/orchestration/jobs.py`:
- Around line 41-92: When building tasks, track skipped images (e.g., add a
skipped_count variable incremented in the "if not image_url" branch) and treat
them as failures: after queue_all_images runs, add skipped_count to
failed_queues (or if tasks is empty, mark job as FAILED when skipped_count > 0
instead of SUCCESS). Update logic around TaskStateManager initialization and the
"if tasks:" / else branch so totals (successful_queues, failed_queues) include
skipped_count and job.progress/state reflect that skipped images count as
failures; reference the PipelineProcessingTask creation loop, the skipped image
logger call, the queue_all_images async function, and the final
async_to_sync(queue_all_images)() call to locate where to apply these changes.
♻️ Duplicate comments (1)
ami/jobs/tasks.py (1)
102-130: Post-ACK failures can leave progress stale—ACK last or retry.ACK is sent before results-stage updates; if
update_state/_update_job_progressfails after ACK, NATS won’t redeliver and progress may stay behind. Consider moving ACK after results-stage updates (or explicitly retry on post-ACK failures).🐛 Proposed fix
- _ack_task_via_nats(reply_subject, job.logger) - # Update job stage with calculated progress - progress_info = state_manager.update_state(processed_image_ids, stage="results", request_id=self.request.id) + # Update job stage with calculated progress + progress_info = state_manager.update_state(processed_image_ids, stage="results", request_id=self.request.id) @@ - _update_job_progress(job_id, "results", progress_info.percentage) + _update_job_progress(job_id, "results", progress_info.percentage) + _ack_task_via_nats(reply_subject, job.logger)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
ami/jobs/tasks.pyami/jobs/views.pyami/ml/orchestration/jobs.pyami/ml/orchestration/nats_queue.py
🧰 Additional context used
🧬 Code graph analysis (3)
ami/ml/orchestration/nats_queue.py (1)
ami/ml/schemas.py (1)
PipelineProcessingTask(217-228)
ami/jobs/views.py (3)
ami/jobs/tasks.py (1)
process_pipeline_result(47-130)ami/ml/schemas.py (1)
PipelineTaskResult(231-237)ami/ml/orchestration/nats_queue.py (2)
TaskQueueManager(35-300)reserve_task(159-214)
ami/ml/orchestration/jobs.py (4)
ami/jobs/models.py (4)
Job(734-1019)JobState(27-63)logger(1004-1013)save(954-965)ami/ml/orchestration/nats_queue.py (3)
TaskQueueManager(35-300)cleanup_job_resources(284-300)publish_task(126-157)ami/ml/orchestration/task_state.py (3)
TaskStateManager(17-125)cleanup(119-125)initialize_job(39-49)ami/ml/schemas.py (1)
PipelineProcessingTask(217-228)
🪛 Ruff (0.14.11)
ami/ml/orchestration/nats_queue.py
80-80: Avoid specifying long messages outside the exception class
(TRY003)
88-88: Do not catch blind exception: Exception
(BLE001)
101-101: Avoid specifying long messages outside the exception class
(TRY003)
110-110: Do not catch blind exception: Exception
(BLE001)
138-138: Avoid specifying long messages outside the exception class
(TRY003)
153-153: Consider moving this statement to an else block
(TRY300)
155-155: Do not catch blind exception: Exception
(BLE001)
156-156: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
171-171: Avoid specifying long messages outside the exception class
(TRY003)
212-212: Do not catch blind exception: Exception
(BLE001)
213-213: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
227-227: Avoid specifying long messages outside the exception class
(TRY003)
232-232: Consider moving this statement to an else block
(TRY300)
233-233: Do not catch blind exception: Exception
(BLE001)
234-234: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
248-248: Avoid specifying long messages outside the exception class
(TRY003)
256-256: Consider moving this statement to an else block
(TRY300)
257-257: Do not catch blind exception: Exception
(BLE001)
258-258: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
272-272: Avoid specifying long messages outside the exception class
(TRY003)
279-279: Consider moving this statement to an else block
(TRY300)
280-280: Do not catch blind exception: Exception
(BLE001)
281-281: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/jobs/views.py
253-253: Unused method argument: pk
(ARG002)
306-306: Avoid specifying long messages outside the exception class
(TRY003)
308-308: Do not catch blind exception: Exception
(BLE001)
309-309: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/jobs/tasks.py
98-98: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
124-124: Abstract raise to an inner function
(TRY301)
127-127: Do not catch blind exception: Exception
(BLE001)
128-130: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
146-146: Do not catch blind exception: Exception
(BLE001)
147-147: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
ami/ml/orchestration/jobs.py
75-75: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Redirect rules
- GitHub Check: Header rules
- GitHub Check: Pages changed
- GitHub Check: test
🔇 Additional comments (8)
ami/ml/orchestration/jobs.py (1)
10-25: LGTM: cleanup helper returns success status.Returning the async cleanup result allows callers to react to failures.
ami/ml/orchestration/nats_queue.py (4)
26-63: LGTM: connection lifecycle is cleaned up.Context manager setup/teardown is clear and ensures the client is closed.
77-125: LGTM: per-job stream/consumer provisioning is clear.
126-214: LGTM: publish/reserve flow and reply_subject mapping.
216-300: LGTM: ACK and cleanup helpers are straightforward.ami/jobs/tasks.py (3)
133-148: LGTM: ACK helper encapsulates NATS call cleanly.
151-166: LGTM: atomic progress update under row lock.
207-229: LGTM: lightweight timing helper is clear.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| queued_tasks = [] | ||
| try: | ||
| queued_tasks = [] | ||
| # Queue each result for background processing | ||
| for item in results: | ||
| task_result = PipelineTaskResult(**item) | ||
| # Stub: Log that we received the result but don't process it yet | ||
| logger.warning( | ||
| f"Result endpoint called for job {job_id} (reply_subject: {task_result.reply_subject}) " | ||
| "but result processing not yet available." | ||
| reply_subject = task_result.reply_subject | ||
| result_data = task_result.result | ||
|
|
||
| # Queue the background task | ||
| # Convert Pydantic model to dict for JSON serialization | ||
| task = process_pipeline_result.delay( | ||
| job_id=job.pk, result_data=result_data.dict(), reply_subject=reply_subject | ||
| ) | ||
|
|
||
| # TODO: Implement result storage and processing | ||
| queued_tasks.append( | ||
| { | ||
| "reply_subject": task_result.reply_subject, | ||
| "status": "pending_implementation", | ||
| "message": "Result processing not yet implemented.", | ||
| "reply_subject": reply_subject, | ||
| "status": "queued", | ||
| "task_id": task.id, | ||
| } | ||
| ) | ||
|
|
||
| logger.info( | ||
| f"Queued pipeline result processing for job {job.pk}, " | ||
| f"task_id: {task.id}, reply_subject: {reply_subject}" | ||
| ) | ||
|
|
||
| return Response( | ||
| { | ||
| "status": "accepted", | ||
| "job_id": job.pk, | ||
| "results_queued": len([t for t in queued_tasks if t["status"] == "queued"]), | ||
| "tasks": queued_tasks, | ||
| } | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and examine the file structure
fd "views.py" | grep -E "ami/jobs"Repository: RolnickLab/antenna
Length of output: 79
🏁 Script executed:
# Get the full context of lines 270-304 and surrounding code
cat -n ami/jobs/views.py | sed -n '250,320p'Repository: RolnickLab/antenna
Length of output: 2932
🏁 Script executed:
# Look for PipelineTaskResult definition
rg "class PipelineTaskResult" -A 20Repository: RolnickLab/antenna
Length of output: 948
🏁 Script executed:
# Check the try-except structure and what happens on ValidationError
rg "process_pipeline_result" -B 5 -A 15 ami/jobs/views.pyRepository: RolnickLab/antenna
Length of output: 1704
Validate all results before enqueuing to prevent partial queueing and duplicate task processing.
If validation fails on a later item, earlier items are already dispatched to Celery (line 280) but the error response (line 306) may cause clients to retry with the same data, resulting in duplicate tasks. Pre-validate all items before enqueueing any tasks.
Proposed fix
queued_tasks = []
try:
+ validated_results = [PipelineTaskResult(**item) for item in results]
+ # Queue each result for background processing
- # Queue each result for background processing
- for item in results:
- task_result = PipelineTaskResult(**item)
+ for task_result in validated_results:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| queued_tasks = [] | |
| try: | |
| queued_tasks = [] | |
| # Queue each result for background processing | |
| for item in results: | |
| task_result = PipelineTaskResult(**item) | |
| # Stub: Log that we received the result but don't process it yet | |
| logger.warning( | |
| f"Result endpoint called for job {job_id} (reply_subject: {task_result.reply_subject}) " | |
| "but result processing not yet available." | |
| reply_subject = task_result.reply_subject | |
| result_data = task_result.result | |
| # Queue the background task | |
| # Convert Pydantic model to dict for JSON serialization | |
| task = process_pipeline_result.delay( | |
| job_id=job.pk, result_data=result_data.dict(), reply_subject=reply_subject | |
| ) | |
| # TODO: Implement result storage and processing | |
| queued_tasks.append( | |
| { | |
| "reply_subject": task_result.reply_subject, | |
| "status": "pending_implementation", | |
| "message": "Result processing not yet implemented.", | |
| "reply_subject": reply_subject, | |
| "status": "queued", | |
| "task_id": task.id, | |
| } | |
| ) | |
| logger.info( | |
| f"Queued pipeline result processing for job {job.pk}, " | |
| f"task_id: {task.id}, reply_subject: {reply_subject}" | |
| ) | |
| return Response( | |
| { | |
| "status": "accepted", | |
| "job_id": job.pk, | |
| "results_queued": len([t for t in queued_tasks if t["status"] == "queued"]), | |
| "tasks": queued_tasks, | |
| } | |
| ) | |
| queued_tasks = [] | |
| try: | |
| validated_results = [PipelineTaskResult(**item) for item in results] | |
| # Queue each result for background processing | |
| for task_result in validated_results: | |
| reply_subject = task_result.reply_subject | |
| result_data = task_result.result | |
| # Queue the background task | |
| # Convert Pydantic model to dict for JSON serialization | |
| task = process_pipeline_result.delay( | |
| job_id=job.pk, result_data=result_data.dict(), reply_subject=reply_subject | |
| ) | |
| queued_tasks.append( | |
| { | |
| "reply_subject": reply_subject, | |
| "status": "queued", | |
| "task_id": task.id, | |
| } | |
| ) | |
| logger.info( | |
| f"Queued pipeline result processing for job {job.pk}, " | |
| f"task_id: {task.id}, reply_subject: {reply_subject}" | |
| ) | |
| return Response( | |
| { | |
| "status": "accepted", | |
| "job_id": job.pk, | |
| "results_queued": len([t for t in queued_tasks if t["status"] == "queued"]), | |
| "tasks": queued_tasks, | |
| } | |
| ) |
🤖 Prompt for AI Agents
In `@ami/jobs/views.py` around lines 270 - 304, The current code validates and
enqueues each item as it iterates over results (using PipelineTaskResult and
process_pipeline_result.delay), which can dispatch partial work if a later item
fails; change the flow to first validate all items without calling
process_pipeline_result.delay (e.g., iterate results and construct/validate
PipelineTaskResult objects into a temporary list), and only after every item
validates successfully iterate that temporary list to call
process_pipeline_result.delay and populate queued_tasks for the response (keep
using job.pk, reply_subject and result_data.dict() when enqueuing).
| # Prepare all messages outside of async context to avoid Django ORM issues | ||
| tasks: list[tuple[int, PipelineProcessingTask]] = [] | ||
| image_ids = [] | ||
| for image in images: | ||
| image_id = str(image.pk) | ||
| image_url = image.url() if hasattr(image, "url") and image.url() else "" | ||
| if not image_url: | ||
| job.logger.warning(f"Image {image.pk} has no URL, skipping queuing to NATS for job '{job.pk}'") | ||
| continue | ||
| image_ids.append(image_id) | ||
| task = PipelineProcessingTask( | ||
| id=image_id, | ||
| image_id=image_id, | ||
| image_url=image_url, | ||
| ) | ||
| tasks.append((image.pk, task)) | ||
|
|
||
| # Store all image IDs in Redis for progress tracking | ||
| state_manager = TaskStateManager(job.pk) | ||
| state_manager.initialize_job(image_ids) | ||
| job.logger.info(f"Initialized task state tracking for {len(image_ids)} images") | ||
|
|
||
| async def queue_all_images(): | ||
| successful_queues = 0 | ||
| failed_queues = 0 | ||
|
|
||
| async with TaskQueueManager() as manager: | ||
| for image_pk, task in tasks: | ||
| try: | ||
| logger.info(f"Queueing image {image_pk} to stream for job '{job.pk}': {task.image_url}") | ||
| success = await manager.publish_task( | ||
| job_id=job.pk, | ||
| data=task, | ||
| ) | ||
| except Exception as e: | ||
| logger.error(f"Failed to queue image {image_pk} to stream for job '{job.pk}': {e}") | ||
| success = False | ||
|
|
||
| if success: | ||
| successful_queues += 1 | ||
| else: | ||
| failed_queues += 1 | ||
|
|
||
| return successful_queues, failed_queues | ||
|
|
||
| if tasks: | ||
| successful_queues, failed_queues = async_to_sync(queue_all_images)() | ||
| else: | ||
| job.progress.update_stage("process", status=JobState.SUCCESS, progress=1.0) | ||
| job.progress.update_stage("results", status=JobState.SUCCESS, progress=1.0) | ||
| job.save() | ||
| successful_queues, failed_queues = 0, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Count skipped images as failures to avoid false SUCCESS.
Images without URLs are skipped and don’t affect the returned success value; if all images are skipped, the job is marked SUCCESS even though nothing was queued. Track skipped images and treat “all skipped” as failure to avoid silent drops.
🐛 Proposed fix
- tasks: list[tuple[int, PipelineProcessingTask]] = []
- image_ids = []
+ tasks: list[tuple[int, PipelineProcessingTask]] = []
+ image_ids = []
+ skipped_images = 0
for image in images:
image_id = str(image.pk)
image_url = image.url() if hasattr(image, "url") and image.url() else ""
if not image_url:
job.logger.warning(f"Image {image.pk} has no URL, skipping queuing to NATS for job '{job.pk}'")
+ skipped_images += 1
continue
image_ids.append(image_id)
task = PipelineProcessingTask(
id=image_id,
image_id=image_id,
image_url=image_url,
)
tasks.append((image.pk, task))
@@
- if tasks:
- successful_queues, failed_queues = async_to_sync(queue_all_images)()
- else:
+ if tasks:
+ successful_queues, failed_queues = async_to_sync(queue_all_images)()
+ failed_queues += skipped_images
+ else:
+ if images:
+ job.logger.error(
+ f"All {len(images)} images were skipped (missing URLs); nothing queued for job '{job.pk}'"
+ )
+ return False
job.progress.update_stage("process", status=JobState.SUCCESS, progress=1.0)
job.progress.update_stage("results", status=JobState.SUCCESS, progress=1.0)
job.save()
successful_queues, failed_queues = 0, 0🧰 Tools
🪛 Ruff (0.14.11)
75-75: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@ami/ml/orchestration/jobs.py` around lines 41 - 92, When building tasks,
track skipped images (e.g., add a skipped_count variable incremented in the "if
not image_url" branch) and treat them as failures: after queue_all_images runs,
add skipped_count to failed_queues (or if tasks is empty, mark job as FAILED
when skipped_count > 0 instead of SUCCESS). Update logic around TaskStateManager
initialization and the "if tasks:" / else branch so totals (successful_queues,
failed_queues) include skipped_count and job.progress/state reflect that skipped
images count as failures; reference the PipelineProcessingTask creation loop,
the skipped image logger call, the queue_all_images async function, and the
final async_to_sync(queue_all_images)() call to locate where to apply these
changes.

Summary
Initial version of the Processing service V2.
Closes #971
Closes #968
Closes #969
Current state
The V2 path is working but disabled by default in this PR to allow for extended testing. When enabled, starting a job creates a queue for that job and populates with one task per image. The tasks can be pulled and ACKed via the APIs introduced in PR #1046. The new path can be enabled for a project via the
async_pipeline_workersfeature flag.List of Changes
TaskStateManagerandTaskQueueManagerTODOs:
Related Issues
See issues #970 and #971.
How to Test the Changes
This path can be enabled by turning on the
job.project.feature_flags.async_pipeline_workersfeature flag, seeami/jobs/models.py:400:And running the
ami workerfrom RolnickLab/ami-data-companion#94Test
Test both modes by tweaking the flag in the django admin console:

Deployment Notes
Checklist
Summary by CodeRabbit
New Features
Chores
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.