✨ feat: Add Lambda MicroVM source bundles#36
Conversation
|
| Filename | Overview |
|---|---|
| lambda-microvm/lifecycle/src/paperd.rs | Implements paperd decision + launch; privilege drop to uid/gid 1000 is correct, but the spawned Child handle is immediately dropped (zombie risk for PID 1) and the socket-removal path hardcodes AGENT_HOME instead of deriving from the action's env |
| lambda-microvm/lifecycle/src/server.rs | HTTP routing as a pure function with injected Effects; well-structured with clean lock discipline, correct hook-path filtering, and comprehensive unit tests covering all endpoints |
| lambda-microvm/lifecycle/src/state.rs | Shared state with snapshot serialisation; hooks Vec grows unbounded (one entry per hook event, never capped or pruned) |
| lambda-microvm/lifecycle/src/command.rs | Runs STEREOS_RUN_COMMAND via /bin/bash -lc with pipe-draining threads; thread handles are detached (not joined) on timeout, which is safe since pipes close after kill+wait |
| lambda-microvm/lifecycle/src/dispatch.rs | Faithfully ports Python parse_run_dispatch; well-tested with all edge cases covered including falsy runHookPayload, inner parse failure, non-object inner values |
| lambda-microvm/lifecycle/src/main.rs | Entry point; thread-per-request model mirrors Python ThreadingHTTPServer, HOOK_PORT/PORT fallback is correct, timeout env parsing silently falls back on invalid input |
| lib/lambda-microvm.nix | Assembles rootfs closure, generates Dockerfile with hardcoded Nix store SSL_CERT_FILE paths, stamps home/agent 1000:1000 via two-pass tar append; PermitRootLogin directive may appear twice in sshd_config (harmless, first match wins in OpenSSH) |
| lib/paper-bin.nix | Fetches pinned paper binary for aarch64-linux and x86_64-linux with SRI hashes; straightforward fetchurl derivation |
| flake/images.nix | Wires up lambdaMicrovmPkgs per mixtape and merges into the packages attrset; correctly passes agentPackages from mixtape NixOS config plus paper-bin |
Sequence Diagram
%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant AWS as AWS Lambda MicroVM
participant LC as lifecycle (PID 1)
participant State as Mutex<State>
participant CMD as STEREOS_RUN_COMMAND (bg thread)
participant paperd as paperd (uid 1000)
Note over LC: startup
LC->>paperd: spawn (uid/gid 1000, own process group)
LC->>State: "set paperd=started"
LC->>LC: bind 0.0.0.0:9000
AWS->>LC: GET /aws/.../v1/ready
LC->>State: record_hook(ready)
LC-->>AWS: 200 snapshot
AWS->>LC: GET /aws/.../v1/validate
LC->>State: record_hook(validate)
LC-->>AWS: 200 snapshot
AWS->>LC: "POST /aws/.../v1/run {runHookPayload}"
LC->>State: run_count++, last_run_dispatch
LC->>CMD: spawn_dispatch(parsed session)
LC-->>AWS: 200 status accepted
CMD->>State: "last_run_command = result"
AWS->>LC: POST /aws/.../v1/suspend
LC->>State: "suspended=true"
LC-->>AWS: 200 snapshot
AWS->>LC: POST /aws/.../v1/resume
LC->>State: "suspended=false"
LC-->>AWS: 200 snapshot
AWS->>LC: POST /aws/.../v1/terminate
LC-->>AWS: 200 status ok
LC->>LC: sleep 250ms, process::exit(0)
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant AWS as AWS Lambda MicroVM
participant LC as lifecycle (PID 1)
participant State as Mutex<State>
participant CMD as STEREOS_RUN_COMMAND (bg thread)
participant paperd as paperd (uid 1000)
Note over LC: startup
LC->>paperd: spawn (uid/gid 1000, own process group)
LC->>State: "set paperd=started"
LC->>LC: bind 0.0.0.0:9000
AWS->>LC: GET /aws/.../v1/ready
LC->>State: record_hook(ready)
LC-->>AWS: 200 snapshot
AWS->>LC: GET /aws/.../v1/validate
LC->>State: record_hook(validate)
LC-->>AWS: 200 snapshot
AWS->>LC: "POST /aws/.../v1/run {runHookPayload}"
LC->>State: run_count++, last_run_dispatch
LC->>CMD: spawn_dispatch(parsed session)
LC-->>AWS: 200 status accepted
CMD->>State: "last_run_command = result"
AWS->>LC: POST /aws/.../v1/suspend
LC->>State: "suspended=true"
LC-->>AWS: 200 snapshot
AWS->>LC: POST /aws/.../v1/resume
LC->>State: "suspended=false"
LC-->>AWS: 200 snapshot
AWS->>LC: POST /aws/.../v1/terminate
LC-->>AWS: 200 status ok
LC->>LC: sleep 250ms, process::exit(0)
Comments Outside Diff (1)
-
lambda-microvm/lifecycle/src/state.rs, line 1704 (link)Unbounded
hooksVec grows for the lifetime of the processEvery
record_hookcall appends tohooksand nothing ever prunes it. During image creation AWS delivers at leastreadyandvalidate; at runtime eachrun/suspend/resume/terminateinvocation adds another entry. For a MicroVM that is recycled without a restart this will grow without bound and the full list is serialized into every response snapshot. A simple ring-buffer approach (e.g. keeping the last N entries) or ahooks.truncateafter some cap would prevent unbounded accumulation.Prompt To Fix With AI
This is a comment left during a code review. Path: lambda-microvm/lifecycle/src/state.rs Line: 1704 Comment: **Unbounded `hooks` Vec grows for the lifetime of the process** Every `record_hook` call appends to `hooks` and nothing ever prunes it. During image creation AWS delivers at least `ready` and `validate`; at runtime each `run`/`suspend`/`resume`/`terminate` invocation adds another entry. For a MicroVM that is recycled without a restart this will grow without bound and the full list is serialized into every response snapshot. A simple ring-buffer approach (e.g. keeping the last N entries) or a `hooks.truncate` after some cap would prevent unbounded accumulation. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
lambda-microvm/lifecycle/src/paperd.rs:103-107
**Dropped `Child` handle creates zombie when lifecycle is PID 1**
The `_` wildcard immediately drops the `Child` struct. Rust's `Drop` for `Child` does not call `wait()` — it just detaches. Because the lifecycle server is the container's ENTRYPOINT (PID 1), the kernel never adopts orphaned grandchildren to reap them, so an exited paperd stays in the process table as a zombie until the container itself exits. Spawning a reaper thread keeps the existing "fire and forget" semantics while ensuring the process table entry is cleaned up.
```suggestion
match cmd.spawn() {
Ok(mut child) => {
tracing::info!(binary = %binary.display(), "started paperd");
// Reap the child when it exits so it doesn't linger as a
// zombie — important when the lifecycle process is PID 1.
std::thread::spawn(move || {
let _ = child.wait();
tracing::warn!("paperd exited");
});
"started".to_string()
}
```
### Issue 2 of 3
lambda-microvm/lifecycle/src/paperd.rs:82-85
**`socket_path` uses the `AGENT_HOME` constant, not the `home` from `decide()`**
`decide()` accepts an arbitrary `home` parameter and encodes it into the returned `env`. `start()` correctly passes that `env` to the spawned command, but the stale-socket removal always uses the `AGENT_HOME` constant. If `decide()` is ever called with a different home (e.g. in a test or a future call site), `start()` would attempt to remove the socket from the wrong path, leaving a stale socket that causes paperd's `bind()` to fail with `EADDRINUSE`. Extracting `HOME` from `env` keeps the two halves consistent.
```suggestion
PaperdAction::Start { binary, env } => {
// A stale socket from a prior boot would block bind. This is a file
// removal, not a permission change, so it is fine at runtime.
let home = env
.iter()
.find(|(k, _)| k == "HOME")
.map(|(_, v)| v.as_str())
.unwrap_or(AGENT_HOME);
let sock = socket_path(home);
```
### Issue 3 of 3
lambda-microvm/lifecycle/src/state.rs:1704
**Unbounded `hooks` Vec grows for the lifetime of the process**
Every `record_hook` call appends to `hooks` and nothing ever prunes it. During image creation AWS delivers at least `ready` and `validate`; at runtime each `run`/`suspend`/`resume`/`terminate` invocation adds another entry. For a MicroVM that is recycled without a restart this will grow without bound and the full list is serialized into every response snapshot. A simple ring-buffer approach (e.g. keeping the last N entries) or a `hooks.truncate` after some cap would prevent unbounded accumulation.
Reviews (1): Last reviewed commit: "✨ feat: Add Lambda MicroVM source bundle..." | Re-trigger Greptile
| match cmd.spawn() { | ||
| Ok(_) => { | ||
| tracing::info!(binary = %binary.display(), "started paperd"); | ||
| "started".to_string() | ||
| } |
There was a problem hiding this comment.
Dropped
Child handle creates zombie when lifecycle is PID 1
The _ wildcard immediately drops the Child struct. Rust's Drop for Child does not call wait() — it just detaches. Because the lifecycle server is the container's ENTRYPOINT (PID 1), the kernel never adopts orphaned grandchildren to reap them, so an exited paperd stays in the process table as a zombie until the container itself exits. Spawning a reaper thread keeps the existing "fire and forget" semantics while ensuring the process table entry is cleaned up.
| match cmd.spawn() { | |
| Ok(_) => { | |
| tracing::info!(binary = %binary.display(), "started paperd"); | |
| "started".to_string() | |
| } | |
| match cmd.spawn() { | |
| Ok(mut child) => { | |
| tracing::info!(binary = %binary.display(), "started paperd"); | |
| // Reap the child when it exits so it doesn't linger as a | |
| // zombie — important when the lifecycle process is PID 1. | |
| std::thread::spawn(move || { | |
| let _ = child.wait(); | |
| tracing::warn!("paperd exited"); | |
| }); | |
| "started".to_string() | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: lambda-microvm/lifecycle/src/paperd.rs
Line: 103-107
Comment:
**Dropped `Child` handle creates zombie when lifecycle is PID 1**
The `_` wildcard immediately drops the `Child` struct. Rust's `Drop` for `Child` does not call `wait()` — it just detaches. Because the lifecycle server is the container's ENTRYPOINT (PID 1), the kernel never adopts orphaned grandchildren to reap them, so an exited paperd stays in the process table as a zombie until the container itself exits. Spawning a reaper thread keeps the existing "fire and forget" semantics while ensuring the process table entry is cleaned up.
```suggestion
match cmd.spawn() {
Ok(mut child) => {
tracing::info!(binary = %binary.display(), "started paperd");
// Reap the child when it exits so it doesn't linger as a
// zombie — important when the lifecycle process is PID 1.
std::thread::spawn(move || {
let _ = child.wait();
tracing::warn!("paperd exited");
});
"started".to_string()
}
```
How can I resolve this? If you propose a fix, please make it concise.| PaperdAction::Start { binary, env } => { | ||
| // A stale socket from a prior boot would block bind. This is a file | ||
| // removal, not a permission change, so it is fine at runtime. | ||
| let sock = socket_path(AGENT_HOME); |
There was a problem hiding this comment.
socket_path uses the AGENT_HOME constant, not the home from decide()
decide() accepts an arbitrary home parameter and encodes it into the returned env. start() correctly passes that env to the spawned command, but the stale-socket removal always uses the AGENT_HOME constant. If decide() is ever called with a different home (e.g. in a test or a future call site), start() would attempt to remove the socket from the wrong path, leaving a stale socket that causes paperd's bind() to fail with EADDRINUSE. Extracting HOME from env keeps the two halves consistent.
| PaperdAction::Start { binary, env } => { | |
| // A stale socket from a prior boot would block bind. This is a file | |
| // removal, not a permission change, so it is fine at runtime. | |
| let sock = socket_path(AGENT_HOME); | |
| PaperdAction::Start { binary, env } => { | |
| // A stale socket from a prior boot would block bind. This is a file | |
| // removal, not a permission change, so it is fine at runtime. | |
| let home = env | |
| .iter() | |
| .find(|(k, _)| k == "HOME") | |
| .map(|(_, v)| v.as_str()) | |
| .unwrap_or(AGENT_HOME); | |
| let sock = socket_path(home); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: lambda-microvm/lifecycle/src/paperd.rs
Line: 82-85
Comment:
**`socket_path` uses the `AGENT_HOME` constant, not the `home` from `decide()`**
`decide()` accepts an arbitrary `home` parameter and encodes it into the returned `env`. `start()` correctly passes that `env` to the spawned command, but the stale-socket removal always uses the `AGENT_HOME` constant. If `decide()` is ever called with a different home (e.g. in a test or a future call site), `start()` would attempt to remove the socket from the wrong path, leaving a stale socket that causes paperd's `bind()` to fail with `EADDRINUSE`. Extracting `HOME` from `env` keeps the two halves consistent.
```suggestion
PaperdAction::Start { binary, env } => {
// A stale socket from a prior boot would block bind. This is a file
// removal, not a permission change, so it is fine at runtime.
let home = env
.iter()
.find(|(k, _)| k == "HOME")
.map(|(_, v)| v.as_str())
.unwrap_or(AGENT_HOME);
let sock = socket_path(home);
```
How can I resolve this? If you propose a fix, please make it concise.Add a Dockerfile-based source package for AWS Lambda MicroVM image creation, built by Nix and kept separate from stereOS VM mixtape artifacts (no stereos.img / qcow2 / kernel). `flake/images.nix` exposes `packages.<system>.<mixtape>-lambda-microvm-source` for each mixtape. The image entrypoint is a small Rust HTTP hook server (`lambda-microvm/lifecycle`), installed as `/bin/lambda-microvm-lifecycle`. It serves the AWS image-build hooks (`ready`, `validate`) and runtime hooks (`run`, `suspend`, `resume`, `terminate`), plus direct debug endpoints, parses the AWS `runHookPayload` envelope down to its `session` object, and optionally launches `paperd` as uid/gid 1000. The server uses only a blocking HTTP listener (tiny_http) — no async runtime — and drops privilege via the kernel at exec, matching the platform's Rust direction and keeping the audit surface small. Behaviour is locked down by unit tests over dispatch parsing, hook routing, command execution, and the paperd start decision. The rootfs is assembled from a Nix closure. The agent home and its XDG tree are created and owned 1000:1000 at build time, so the image is ready to run with no runtime mkdir/chown. OpenSSH paths are included for the `microvmssh` shell-ingress pattern. Bundles ship the pinned public Paper release binary and set `STEREOS_START_PAPERD=1`.
The Paper release binary is a glibc-dynamic ELF whose interpreter is the FHS /lib/ld-linux-*.so path, which does not exist in the Lambda MicroVM scratch rootfs — only the Nix store glibc does. Without patching, paper and paperd cannot execute there, so STEREOS_START_PAPERD silently fails and paperd never comes up despite being installed. Run autoPatchelfHook so the interpreter and RPATH point at the Nix glibc (the same loader the lifecycle binary uses), making paper runnable inside the bundle rootfs.
c6f8eed to
1143610
Compare
The first interactive `claude` on a fresh MicroVM spends ~a minute installing its ~240MB native build before it is usable. Do that work once, during the AWS `ready` build hook, so it lands in the image snapshot and every launched MicroVM starts warm. Measured cold first `claude --version`: ~58s -> ~0.5s. Adds a generic STEREOS_READY_COMMAND that the lifecycle runs once on the first `ready` hook, before responding 200 (the snapshot is taken after ready returns), as the agent user with an agent-rooted login env. Mixtapes built with `warmAgent = true` ship /usr/local/bin/stereos-warm-agent, which runs `claude install` (no auth, no paper) and pre-seeds onboarding so the first interactive run skips the theme prompt. The Firecracker snapshot is shared across every MicroVM from the image, so the script strips the per-machine ids `claude install` writes (machineID, userID) so they regenerate per-VM. Warm-up failures (e.g. no build-time network) are non-fatal: the image just ships unwarmed.
da63c36 to
4d4fe38
Compare
🧍I've been experimenting with running a part of stereOS in AWS Lambda MicroVMs and this work allows the wiring to work out with it. I built the original lifecycle hook in python but figured I would try it out in Rust to have something more familiar for myself. 🧍
🤖
Summary
scratchrootfs tar) per mixtape, exposed aspackages.<system>.<mixtape>-lambda-microvm-source. Deliberately separate from VM mixtape artifacts (no img/qcow2/kernel).lambda-microvm/lifecycle) implementing the AWS Lambda MicroVM hook contract, optionally startingpaperd.1000:1000at build time, so there is no runtimemkdir/chown.microvmsshshell-ingress pattern.Why Rust (not Python or Go)
The lifecycle began as a Python POC to discover the hook contract. For the upstream shape it is a Rust binary: it supervises
paperd(itself Rust), matches the platform's Rust investment, and drops the ~150–200 MB Python runtime from thescratchrootfs. It uses a blocking HTTP listener (tiny_http) with no async runtime — the work is subprocess supervision, so tokio/axum would only enlarge the audit surface of a privilege-dropping entrypoint. Privilege drop to uid/gid 1000 happens in the kernel atexec(CommandExt::{uid,gid}), nounsafe.Hook contract
GET /,/health,/ready,/validate…/v1/ready,…/v1/validate(GET)…/v1/run(POST)runHookPayload→session, recordlast_run_dispatch, ack, runSTEREOS_RUN_COMMANDin background…/v1/{suspend,resume,terminate}POST /run,/suspend,/resume,/terminatepaperdstarts only whenSTEREOS_START_PAPERDis truthy, as uid/gid 1000 withHOME/XDG_*under/home/agent.Test plan
Automated (green locally):
cargo test— 34 unit tests (dispatch parsing, hook routing, command exec, paperd decision, snapshot key-set drift guard)cargo clippy --all-targets -D warnings,cargo fmt --checknix build .#packages.aarch64-linux.{base,coder}-lambda-microvm-source(on an aarch64 Linux builder)Manual — real AWS Lambda MicroVMs (us-west-2, POC account):
ready/validatehooks against the Rust binary → imageCREATEDrunhook delivered at launch; envelope double-decode verified live (last_run_dispatch={"mode":…})uid=1000(agent); full HTTP surface answers onlocalhost:9000/home/agent/**ships owned1000:1000— no runtime permission changespaperdrunning asagent(pid inpaper status), socket in the build-owned~/.local/state/paper/— verified in the AWS image with thepaper-binfix belowCommits
✨ feat: Add Lambda MicroVM source bundles— the bundle + Rust lifecycle.🔧 fix: autopatchelf the prebuilt paper binary— a distinct fix kept as its own commit. The Paper release binary's interpreter is the FHS/lib/ld-linux-*.sopath, absent from thescratchrootfs, so withoutautoPatchelfHookpaper/paperdcannot execute andSTEREOS_START_PAPERDsilently fails. Included here rather than deferred so the bundle is not shipped with a non-functionalpaperd.Related to PCC-765