Background container start: async worker for container starts by adamdoupe · Pull Request #1051 · pwncollege/dojo

adamdoupe · 2026-01-31T03:50:45Z

Summary

Container starts are now enqueued to a Redis Stream (container:starts) and processed by a dedicated container-worker service with configurable parallelism (CONTAINER_WORKERS, default 8)
POST /docker returns immediately with a start_id; clients poll GET /docker/status?id= for progress (queued → starting → ready/failed)
Decouples web server from long-running container operations, improving responsiveness and reliability under concurrent load

Test plan

Verify container-worker service starts and stays running (dojo compose ps)
Start a challenge via the web UI — confirm polling shows status progression and workspace loads on success
Start a challenge via test helper — ./deploy.sh -t passes existing tests with updated async start_challenge() in test/utils.py
Run new tests in test/test_container_worker.py (async flow, status progression, concurrent lock, unknown ID)
Verify container start retries work by checking ctfd/container-worker logs for retry attempts on transient failures
Confirm lock prevents double-starts: clicking Start twice quickly returns "please wait" on the second attempt

🤖 Generated with Claude Code

Container starts are now enqueued to a Redis Stream and processed by a dedicated container-worker service with configurable parallelism. The POST /docker endpoint returns immediately with a start_id, and clients poll GET /docker/status for progress. This decouples the web server from long-running container operations and improves reliability under concurrent load. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e21e5eaaf6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-31T03:58:11Z

dojo_plugin/api/v1/docker.py

+            "challenge_index": str(dojo_challenge.challenge_index),
+            "practice": str(int(bool(practice))),
+            "as_user_id": str(as_user_id) if as_user_id is not None else "",


Preserve workspace-token user in queued start

The queued payload only includes as_user_id, but that field is only populated in the admin as_user branch. If a request uses X-Workspace-Token, as_user is set earlier yet as_user_id stays None, so this stream entry sends an empty as_user_id and the worker can only start the challenge as the current user. That changes mounts/flags for token-based starts compared to the synchronous path. Please propagate the token user ID into the payload (or resolve it in the worker) so workspace-token launches target the intended user.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-31T03:58:11Z

dojo_plugin/worker/container_start.py

+                    try:
+                        process_start(r, consumer_name, message_id, message_data)
+                    except Exception as e:
+                        logger.error(f"Unexpected error processing {message_id}: {e}", exc_info=True)
+                    finally:
+                        r.xack(CONTAINER_STARTS_STREAM, CONSUMER_GROUP, message_id)


Avoid ACKing failed start jobs

This loop always XACKs in finally even if process_start throws (e.g., transient DB/Redis error before status is updated). In that case the stream entry is dropped, the status key can remain stuck at queued, and the user will poll until timeout with no automatic retry. Consider only ACKing after process_start succeeds (or after you write a terminal failed status), and leaving the message pending on unexpected exceptions so another worker can retry.

Useful? React with 👍 / 👎.

adamdoupe · 2026-01-31T04:08:32Z

dojo_plugin/api/v1/docker.py

        return cached
    find_command = ["/bin/find", "/dev", "-type", "c"]
-    # When using certain logging drivers (like Splunk), docker.containers.run() returns None
-    # Use detach=True and logs() to capture output instead


Let’s not remove comments just to remove comments, let’s just not add unnecessary comments in future written code.

adamdoupe · 2026-01-31T04:08:47Z

dojo_plugin/api/v1/docker.py



 def remove_container(user):
-    # Just in case our container is still running on the other docker container, let's make sure we try to kill both


See other comment about killing comments

adamdoupe · 2026-01-31T04:09:25Z

dojo_plugin/api/v1/docker.py

    else:
        raise RuntimeError(f"Workspace failed to become ready.")

-def docker_locked(func):


Do we have tests that exercise this “locked” behavior?

Yes — test_concurrent_start_same_user in test/test_container_worker.py exercises this. It starts a challenge, then immediately tries a second start for the same user and asserts it gets rejected with the lock error.

Yes — test_concurrent_start_same_user in test/test_container_worker.py starts two challenges concurrently for the same user and verifies the second gets the lock-contention error.

adamdoupe · 2026-01-31T04:13:45Z

dojo_plugin/api/v1/docker.py

+        status = get_start_status(redis_client, start_id)
+
+        if status is None:
+            return {"success": False, "error": "Unknown start ID"}


Should this be a 404 HTTP response? Does that make more semantics sense? (although, is that something we’d need to handle on the frontend?

adamdoupe · 2026-01-31T04:17:31Z

dojo_plugin/api/v1/docker.py

+
+        user = get_current_user()
+        if status.get("user_id") != user.id:
+            return {"success": False, "error": "Unknown start ID"}


The error message here is the same as above, but the issue is unknown user_id?

Intentional — returning the same "Unknown start ID" prevents leaking information about whether a start_id exists for a different user. If we returned a distinct message, an attacker could enumerate valid start_ids.

adamdoupe · 2026-01-31T04:20:34Z

dojo_plugin/worker/container_start.py

+RETRY_DELAY = 2
+
+
+def get_redis_client():


There’s a lot of this function, we should move this to utils or use one that (likely) already exists.

adamdoupe · 2026-01-31T04:22:39Z

dojo_plugin/worker/container_start.py

+                    "error": None, "user_id": user_id,
+                })
+
+                logger.info(f"Starting challenge for user {user_id} start={start_id} (attempt {attempt}/{MAX_ATTEMPTS})")


Big change, for all the logging in this PR, use = format of f-strings so that splunk can easily capture.

adamdoupe · 2026-01-31T04:24:35Z

dojo_plugin/worker/container_start.py

+            break
+
+        try:
+            if autoclaim_counter >= 12:


Should this be a global?

adamdoupe · 2026-01-31T04:25:31Z

dojo_plugin/worker/container_worker.py

+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+handler = logging.StreamHandler()
+handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))


Now that there’s a lot of these components that do logging in this way let’s DRY so that it’s in one place

adamdoupe · 2026-01-31T04:26:44Z

SPEC.md

@@ -0,0 +1,228 @@
+# Background Container Start


Don’t commit the spec just commit the code

…C.md - Restore removed comments on remove_container and get_available_devices - Fix workspace-token as_user_id not propagating to worker payload - Move XACK after process_start (don't ACK failed jobs) - Return 404 for unknown/unauthorized start IDs - DRY get_redis_client: import from background_stats - Use = format f-strings in all new logging for Splunk - Extract AUTOCLAIM_INTERVAL_TICKS constant - DRY worker logging setup into setup_worker_logging() - Handle 404 in frontend status polling - Remove SPEC.md from repo Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Jan 31, 2026

View reviewed changes

adamdoupe commented Jan 31, 2026

View reviewed changes



		def remove_container(user):
		# Just in case our container is still running on the other docker container, let's make sure we try to kill both

Conversation

adamdoupe commented Jan 31, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments