Skip to content

Background container start: async worker for container starts#1051

Open
adamdoupe wants to merge 2 commits intomasterfrom
dojo-background-container-start
Open

Background container start: async worker for container starts#1051
adamdoupe wants to merge 2 commits intomasterfrom
dojo-background-container-start

Conversation

@adamdoupe
Copy link
Contributor

Summary

  • Container starts are now enqueued to a Redis Stream (container:starts) and processed by a dedicated container-worker service with configurable parallelism (CONTAINER_WORKERS, default 8)
  • POST /docker returns immediately with a start_id; clients poll GET /docker/status?id= for progress (queued → starting → ready/failed)
  • Decouples web server from long-running container operations, improving responsiveness and reliability under concurrent load

Test plan

  • Verify container-worker service starts and stays running (dojo compose ps)
  • Start a challenge via the web UI — confirm polling shows status progression and workspace loads on success
  • Start a challenge via test helper — ./deploy.sh -t passes existing tests with updated async start_challenge() in test/utils.py
  • Run new tests in test/test_container_worker.py (async flow, status progression, concurrent lock, unknown ID)
  • Verify container start retries work by checking ctfd/container-worker logs for retry attempts on transient failures
  • Confirm lock prevents double-starts: clicking Start twice quickly returns "please wait" on the second attempt

🤖 Generated with Claude Code

Container starts are now enqueued to a Redis Stream and processed by a
dedicated container-worker service with configurable parallelism. The
POST /docker endpoint returns immediately with a start_id, and clients
poll GET /docker/status for progress. This decouples the web server
from long-running container operations and improves reliability under
concurrent load.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e21e5eaaf6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +541 to +543
"challenge_index": str(dojo_challenge.challenge_index),
"practice": str(int(bool(practice))),
"as_user_id": str(as_user_id) if as_user_id is not None else "",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve workspace-token user in queued start

The queued payload only includes as_user_id, but that field is only populated in the admin as_user branch. If a request uses X-Workspace-Token, as_user is set earlier yet as_user_id stays None, so this stream entry sends an empty as_user_id and the worker can only start the challenge as the current user. That changes mounts/flags for token-based starts compared to the synchronous path. Please propagate the token user ID into the payload (or resolve it in the worker) so workspace-token launches target the intended user.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 182 to 187
try:
process_start(r, consumer_name, message_id, message_data)
except Exception as e:
logger.error(f"Unexpected error processing {message_id}: {e}", exc_info=True)
finally:
r.xack(CONTAINER_STARTS_STREAM, CONSUMER_GROUP, message_id)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid ACKing failed start jobs

This loop always XACKs in finally even if process_start throws (e.g., transient DB/Redis error before status is updated). In that case the stream entry is dropped, the status key can remain stuck at queued, and the user will poll until timeout with no automatic retry. Consider only ACKing after process_start succeeds (or after you write a terminal failed status), and leaving the message pending on unexpected exceptions so another worker can retry.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return cached
find_command = ["/bin/find", "/dev", "-type", "c"]
# When using certain logging drivers (like Splunk), docker.containers.run() returns None
# Use detach=True and logs() to capture output instead
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s not remove comments just to remove comments, let’s just not add unnecessary comments in future written code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.



def remove_container(user):
# Just in case our container is still running on the other docker container, let's make sure we try to kill both
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment about killing comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

else:
raise RuntimeError(f"Workspace failed to become ready.")

def docker_locked(func):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests that exercise this “locked” behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — test_concurrent_start_same_user in test/test_container_worker.py exercises this. It starts a challenge, then immediately tries a second start for the same user and asserts it gets rejected with the lock error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — test_concurrent_start_same_user in test/test_container_worker.py starts two challenges concurrently for the same user and verifies the second gets the lock-contention error.

status = get_start_status(redis_client, start_id)

if status is None:
return {"success": False, "error": "Unknown start ID"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a 404 HTTP response? Does that make more semantics sense? (although, is that something we’d need to handle on the frontend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


user = get_current_user()
if status.get("user_id") != user.id:
return {"success": False, "error": "Unknown start ID"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message here is the same as above, but the issue is unknown user_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — returning the same "Unknown start ID" prevents leaking information about whether a start_id exists for a different user. If we returned a distinct message, an attacker could enumerate valid start_ids.

RETRY_DELAY = 2


def get_redis_client():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a lot of this function, we should move this to utils or use one that (likely) already exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"error": None, "user_id": user_id,
})

logger.info(f"Starting challenge for user {user_id} start={start_id} (attempt {attempt}/{MAX_ATTEMPTS})")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big change, for all the logging in this PR, use = format of f-strings so that splunk can easily capture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

break

try:
if autoclaim_counter >= 12:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a global?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there’s a lot of these components that do logging in this way let’s DRY so that it’s in one place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

SPEC.md Outdated
@@ -0,0 +1,228 @@
# Background Container Start
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t commit the spec just commit the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

…C.md

- Restore removed comments on remove_container and get_available_devices
- Fix workspace-token as_user_id not propagating to worker payload
- Move XACK after process_start (don't ACK failed jobs)
- Return 404 for unknown/unauthorized start IDs
- DRY get_redis_client: import from background_stats
- Use = format f-strings in all new logging for Splunk
- Extract AUTOCLAIM_INTERVAL_TICKS constant
- DRY worker logging setup into setup_worker_logging()
- Handle 404 in frontend status polling
- Remove SPEC.md from repo

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments