Skip to content

✨ feat: Add Lambda MicroVM source bundles#36

Open
yeazelm wants to merge 3 commits into
mainfrom
matt/pcc-765-lambda-microvm-source-bundles
Open

✨ feat: Add Lambda MicroVM source bundles#36
yeazelm wants to merge 3 commits into
mainfrom
matt/pcc-765-lambda-microvm-source-bundles

Conversation

@yeazelm

@yeazelm yeazelm commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

🧍I've been experimenting with running a part of stereOS in AWS Lambda MicroVMs and this work allows the wiring to work out with it. I built the original lifecycle hook in python but figured I would try it out in Rust to have something more familiar for myself. 🧍

🤖

Summary

  • Adds a Nix-built AWS Lambda MicroVM source bundle (Dockerfile + scratch rootfs tar) per mixtape, exposed as packages.<system>.<mixtape>-lambda-microvm-source. Deliberately separate from VM mixtape artifacts (no img/qcow2/kernel).
  • The image entrypoint is a small Rust HTTP hook server (lambda-microvm/lifecycle) implementing the AWS Lambda MicroVM hook contract, optionally starting paperd.
  • The rootfs ships ready-to-run: the agent home and its XDG tree are created and owned 1000:1000 at build time, so there is no runtime mkdir/chown.
  • Bundles include the pinned public Paper release binary and OpenSSH paths for the microvmssh shell-ingress pattern.

Why Rust (not Python or Go)

The lifecycle began as a Python POC to discover the hook contract. For the upstream shape it is a Rust binary: it supervises paperd (itself Rust), matches the platform's Rust investment, and drops the ~150–200 MB Python runtime from the scratch rootfs. It uses a blocking HTTP listener (tiny_http) with no async runtime — the work is subprocess supervision, so tokio/axum would only enlarge the audit surface of a privilege-dropping entrypoint. Privilege drop to uid/gid 1000 happens in the kernel at exec (CommandExt::{uid,gid}), no unsafe.

Hook contract

Endpoint Behaviour
GET /, /health, /ready, /validate JSON state snapshot
AWS …/v1/ready, …/v1/validate (GET) record hook, return snapshot
AWS …/v1/run (POST) parse runHookPayloadsession, record last_run_dispatch, ack, run STEREOS_RUN_COMMAND in background
AWS …/v1/{suspend,resume,terminate} toggle state / schedule exit
POST /run,/suspend,/resume,/terminate direct debug equivalents

paperd starts only when STEREOS_START_PAPERD is truthy, as uid/gid 1000 with HOME/XDG_* under /home/agent.

Test plan

Automated (green locally):

  • cargo test — 34 unit tests (dispatch parsing, hook routing, command exec, paperd decision, snapshot key-set drift guard)
  • cargo clippy --all-targets -D warnings, cargo fmt --check
  • nix build .#packages.aarch64-linux.{base,coder}-lambda-microvm-source (on an aarch64 Linux builder)

Manual — real AWS Lambda MicroVMs (us-west-2, POC account):

  • Image build runs ready/validate hooks against the Rust binary → image CREATED
  • Runtime run hook delivered at launch; envelope double-decode verified live (last_run_dispatch={"mode":…})
  • Lifecycle runs as uid=1000(agent); full HTTP surface answers on localhost:9000
  • /home/agent/** ships owned 1000:1000 — no runtime permission changes
  • paperd running as agent (pid in paper status), socket in the build-owned ~/.local/state/paper/ — verified in the AWS image with the paper-bin fix below

Commits

  1. ✨ feat: Add Lambda MicroVM source bundles — the bundle + Rust lifecycle.
  2. 🔧 fix: autopatchelf the prebuilt paper binary — a distinct fix kept as its own commit. The Paper release binary's interpreter is the FHS /lib/ld-linux-*.so path, absent from the scratch rootfs, so without autoPatchelfHook paper/paperd cannot execute and STEREOS_START_PAPERD silently fails. Included here rather than deferred so the bundle is not shipped with a non-functional paperd.

Related to PCC-765

@yeazelm yeazelm requested a review from a team June 25, 2026 22:29
@linear-code

linear-code Bot commented Jun 25, 2026

Copy link
Copy Markdown

PCC-765

@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds an AWS Lambda MicroVM source bundle per stereOS mixtape: a Nix-built Dockerfile + scratch rootfs tar packaged as a zip, together with a new Rust HTTP lifecycle server (lambda-microvm/lifecycle) that implements the AWS hook contract for ready/validate/run/suspend/resume/terminate. The Rust implementation faithfully ports a Python POC, using an injected-effects pattern to keep routing unit-testable; 34 unit tests cover dispatch parsing, routing, state transitions, and command execution.

  • lib/lambda-microvm.nix assembles the rootfs closure, stamps home/agent ownership to 1000:1000 in a two-pass tar, generates the Dockerfile with pinned Nix-store SSL paths, and zips the bundle.
  • lib/paper-bin.nix fetches the pinned paper CLI binary for aarch64-linux and x86_64-linux.
  • flake/images.nix exposes packages.<system>.<mixtape>-lambda-microvm-source per mixtape.

Confidence Score: 4/5

The lifecycle server is well-structured and the hook contract is faithfully implemented; the Nix build correctly assembles the rootfs closure with proper 1000:1000 ownership.

The three findings are contained to resource management in paperd.rs and a cosmetic state growth issue — none block merge, but the dropped Child handle and socket-path inconsistency are worth addressing before the MicroVM sees production traffic.

lambda-microvm/lifecycle/src/paperd.rs warrants a second look for the Child handle and socket-path issues; lambda-microvm/lifecycle/src/state.rs for the unbounded hooks growth.

Important Files Changed

Filename Overview
lambda-microvm/lifecycle/src/paperd.rs Implements paperd decision + launch; privilege drop to uid/gid 1000 is correct, but the spawned Child handle is immediately dropped (zombie risk for PID 1) and the socket-removal path hardcodes AGENT_HOME instead of deriving from the action's env
lambda-microvm/lifecycle/src/server.rs HTTP routing as a pure function with injected Effects; well-structured with clean lock discipline, correct hook-path filtering, and comprehensive unit tests covering all endpoints
lambda-microvm/lifecycle/src/state.rs Shared state with snapshot serialisation; hooks Vec grows unbounded (one entry per hook event, never capped or pruned)
lambda-microvm/lifecycle/src/command.rs Runs STEREOS_RUN_COMMAND via /bin/bash -lc with pipe-draining threads; thread handles are detached (not joined) on timeout, which is safe since pipes close after kill+wait
lambda-microvm/lifecycle/src/dispatch.rs Faithfully ports Python parse_run_dispatch; well-tested with all edge cases covered including falsy runHookPayload, inner parse failure, non-object inner values
lambda-microvm/lifecycle/src/main.rs Entry point; thread-per-request model mirrors Python ThreadingHTTPServer, HOOK_PORT/PORT fallback is correct, timeout env parsing silently falls back on invalid input
lib/lambda-microvm.nix Assembles rootfs closure, generates Dockerfile with hardcoded Nix store SSL_CERT_FILE paths, stamps home/agent 1000:1000 via two-pass tar append; PermitRootLogin directive may appear twice in sshd_config (harmless, first match wins in OpenSSH)
lib/paper-bin.nix Fetches pinned paper binary for aarch64-linux and x86_64-linux with SRI hashes; straightforward fetchurl derivation
flake/images.nix Wires up lambdaMicrovmPkgs per mixtape and merges into the packages attrset; correctly passes agentPackages from mixtape NixOS config plus paper-bin

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant AWS as AWS Lambda MicroVM
    participant LC as lifecycle (PID 1)
    participant State as Mutex<State>
    participant CMD as STEREOS_RUN_COMMAND (bg thread)
    participant paperd as paperd (uid 1000)

    Note over LC: startup
    LC->>paperd: spawn (uid/gid 1000, own process group)
    LC->>State: "set paperd=started"
    LC->>LC: bind 0.0.0.0:9000

    AWS->>LC: GET /aws/.../v1/ready
    LC->>State: record_hook(ready)
    LC-->>AWS: 200 snapshot

    AWS->>LC: GET /aws/.../v1/validate
    LC->>State: record_hook(validate)
    LC-->>AWS: 200 snapshot

    AWS->>LC: "POST /aws/.../v1/run {runHookPayload}"
    LC->>State: run_count++, last_run_dispatch
    LC->>CMD: spawn_dispatch(parsed session)
    LC-->>AWS: 200 status accepted
    CMD->>State: "last_run_command = result"

    AWS->>LC: POST /aws/.../v1/suspend
    LC->>State: "suspended=true"
    LC-->>AWS: 200 snapshot

    AWS->>LC: POST /aws/.../v1/resume
    LC->>State: "suspended=false"
    LC-->>AWS: 200 snapshot

    AWS->>LC: POST /aws/.../v1/terminate
    LC-->>AWS: 200 status ok
    LC->>LC: sleep 250ms, process::exit(0)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant AWS as AWS Lambda MicroVM
    participant LC as lifecycle (PID 1)
    participant State as Mutex<State>
    participant CMD as STEREOS_RUN_COMMAND (bg thread)
    participant paperd as paperd (uid 1000)

    Note over LC: startup
    LC->>paperd: spawn (uid/gid 1000, own process group)
    LC->>State: "set paperd=started"
    LC->>LC: bind 0.0.0.0:9000

    AWS->>LC: GET /aws/.../v1/ready
    LC->>State: record_hook(ready)
    LC-->>AWS: 200 snapshot

    AWS->>LC: GET /aws/.../v1/validate
    LC->>State: record_hook(validate)
    LC-->>AWS: 200 snapshot

    AWS->>LC: "POST /aws/.../v1/run {runHookPayload}"
    LC->>State: run_count++, last_run_dispatch
    LC->>CMD: spawn_dispatch(parsed session)
    LC-->>AWS: 200 status accepted
    CMD->>State: "last_run_command = result"

    AWS->>LC: POST /aws/.../v1/suspend
    LC->>State: "suspended=true"
    LC-->>AWS: 200 snapshot

    AWS->>LC: POST /aws/.../v1/resume
    LC->>State: "suspended=false"
    LC-->>AWS: 200 snapshot

    AWS->>LC: POST /aws/.../v1/terminate
    LC-->>AWS: 200 status ok
    LC->>LC: sleep 250ms, process::exit(0)
Loading

Comments Outside Diff (1)

  1. lambda-microvm/lifecycle/src/state.rs, line 1704 (link)

    P2 Unbounded hooks Vec grows for the lifetime of the process

    Every record_hook call appends to hooks and nothing ever prunes it. During image creation AWS delivers at least ready and validate; at runtime each run/suspend/resume/terminate invocation adds another entry. For a MicroVM that is recycled without a restart this will grow without bound and the full list is serialized into every response snapshot. A simple ring-buffer approach (e.g. keeping the last N entries) or a hooks.truncate after some cap would prevent unbounded accumulation.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: lambda-microvm/lifecycle/src/state.rs
    Line: 1704
    
    Comment:
    **Unbounded `hooks` Vec grows for the lifetime of the process**
    
    Every `record_hook` call appends to `hooks` and nothing ever prunes it. During image creation AWS delivers at least `ready` and `validate`; at runtime each `run`/`suspend`/`resume`/`terminate` invocation adds another entry. For a MicroVM that is recycled without a restart this will grow without bound and the full list is serialized into every response snapshot. A simple ring-buffer approach (e.g. keeping the last N entries) or a `hooks.truncate` after some cap would prevent unbounded accumulation.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
lambda-microvm/lifecycle/src/paperd.rs:103-107
**Dropped `Child` handle creates zombie when lifecycle is PID 1**

The `_` wildcard immediately drops the `Child` struct. Rust's `Drop` for `Child` does not call `wait()` — it just detaches. Because the lifecycle server is the container's ENTRYPOINT (PID 1), the kernel never adopts orphaned grandchildren to reap them, so an exited paperd stays in the process table as a zombie until the container itself exits. Spawning a reaper thread keeps the existing "fire and forget" semantics while ensuring the process table entry is cleaned up.

```suggestion
            match cmd.spawn() {
                Ok(mut child) => {
                    tracing::info!(binary = %binary.display(), "started paperd");
                    // Reap the child when it exits so it doesn't linger as a
                    // zombie — important when the lifecycle process is PID 1.
                    std::thread::spawn(move || {
                        let _ = child.wait();
                        tracing::warn!("paperd exited");
                    });
                    "started".to_string()
                }
```

### Issue 2 of 3
lambda-microvm/lifecycle/src/paperd.rs:82-85
**`socket_path` uses the `AGENT_HOME` constant, not the `home` from `decide()`**

`decide()` accepts an arbitrary `home` parameter and encodes it into the returned `env`. `start()` correctly passes that `env` to the spawned command, but the stale-socket removal always uses the `AGENT_HOME` constant. If `decide()` is ever called with a different home (e.g. in a test or a future call site), `start()` would attempt to remove the socket from the wrong path, leaving a stale socket that causes paperd's `bind()` to fail with `EADDRINUSE`. Extracting `HOME` from `env` keeps the two halves consistent.

```suggestion
        PaperdAction::Start { binary, env } => {
            // A stale socket from a prior boot would block bind. This is a file
            // removal, not a permission change, so it is fine at runtime.
            let home = env
                .iter()
                .find(|(k, _)| k == "HOME")
                .map(|(_, v)| v.as_str())
                .unwrap_or(AGENT_HOME);
            let sock = socket_path(home);
```

### Issue 3 of 3
lambda-microvm/lifecycle/src/state.rs:1704
**Unbounded `hooks` Vec grows for the lifetime of the process**

Every `record_hook` call appends to `hooks` and nothing ever prunes it. During image creation AWS delivers at least `ready` and `validate`; at runtime each `run`/`suspend`/`resume`/`terminate` invocation adds another entry. For a MicroVM that is recycled without a restart this will grow without bound and the full list is serialized into every response snapshot. A simple ring-buffer approach (e.g. keeping the last N entries) or a `hooks.truncate` after some cap would prevent unbounded accumulation.

Reviews (1): Last reviewed commit: "✨ feat: Add Lambda MicroVM source bundle..." | Re-trigger Greptile

Comment on lines +103 to +107
match cmd.spawn() {
Ok(_) => {
tracing::info!(binary = %binary.display(), "started paperd");
"started".to_string()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dropped Child handle creates zombie when lifecycle is PID 1

The _ wildcard immediately drops the Child struct. Rust's Drop for Child does not call wait() — it just detaches. Because the lifecycle server is the container's ENTRYPOINT (PID 1), the kernel never adopts orphaned grandchildren to reap them, so an exited paperd stays in the process table as a zombie until the container itself exits. Spawning a reaper thread keeps the existing "fire and forget" semantics while ensuring the process table entry is cleaned up.

Suggested change
match cmd.spawn() {
Ok(_) => {
tracing::info!(binary = %binary.display(), "started paperd");
"started".to_string()
}
match cmd.spawn() {
Ok(mut child) => {
tracing::info!(binary = %binary.display(), "started paperd");
// Reap the child when it exits so it doesn't linger as a
// zombie — important when the lifecycle process is PID 1.
std::thread::spawn(move || {
let _ = child.wait();
tracing::warn!("paperd exited");
});
"started".to_string()
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: lambda-microvm/lifecycle/src/paperd.rs
Line: 103-107

Comment:
**Dropped `Child` handle creates zombie when lifecycle is PID 1**

The `_` wildcard immediately drops the `Child` struct. Rust's `Drop` for `Child` does not call `wait()` — it just detaches. Because the lifecycle server is the container's ENTRYPOINT (PID 1), the kernel never adopts orphaned grandchildren to reap them, so an exited paperd stays in the process table as a zombie until the container itself exits. Spawning a reaper thread keeps the existing "fire and forget" semantics while ensuring the process table entry is cleaned up.

```suggestion
            match cmd.spawn() {
                Ok(mut child) => {
                    tracing::info!(binary = %binary.display(), "started paperd");
                    // Reap the child when it exits so it doesn't linger as a
                    // zombie — important when the lifecycle process is PID 1.
                    std::thread::spawn(move || {
                        let _ = child.wait();
                        tracing::warn!("paperd exited");
                    });
                    "started".to_string()
                }
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread lambda-microvm/lifecycle/src/paperd.rs Outdated
Comment on lines +82 to +85
PaperdAction::Start { binary, env } => {
// A stale socket from a prior boot would block bind. This is a file
// removal, not a permission change, so it is fine at runtime.
let sock = socket_path(AGENT_HOME);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 socket_path uses the AGENT_HOME constant, not the home from decide()

decide() accepts an arbitrary home parameter and encodes it into the returned env. start() correctly passes that env to the spawned command, but the stale-socket removal always uses the AGENT_HOME constant. If decide() is ever called with a different home (e.g. in a test or a future call site), start() would attempt to remove the socket from the wrong path, leaving a stale socket that causes paperd's bind() to fail with EADDRINUSE. Extracting HOME from env keeps the two halves consistent.

Suggested change
PaperdAction::Start { binary, env } => {
// A stale socket from a prior boot would block bind. This is a file
// removal, not a permission change, so it is fine at runtime.
let sock = socket_path(AGENT_HOME);
PaperdAction::Start { binary, env } => {
// A stale socket from a prior boot would block bind. This is a file
// removal, not a permission change, so it is fine at runtime.
let home = env
.iter()
.find(|(k, _)| k == "HOME")
.map(|(_, v)| v.as_str())
.unwrap_or(AGENT_HOME);
let sock = socket_path(home);
Prompt To Fix With AI
This is a comment left during a code review.
Path: lambda-microvm/lifecycle/src/paperd.rs
Line: 82-85

Comment:
**`socket_path` uses the `AGENT_HOME` constant, not the `home` from `decide()`**

`decide()` accepts an arbitrary `home` parameter and encodes it into the returned `env`. `start()` correctly passes that `env` to the spawned command, but the stale-socket removal always uses the `AGENT_HOME` constant. If `decide()` is ever called with a different home (e.g. in a test or a future call site), `start()` would attempt to remove the socket from the wrong path, leaving a stale socket that causes paperd's `bind()` to fail with `EADDRINUSE`. Extracting `HOME` from `env` keeps the two halves consistent.

```suggestion
        PaperdAction::Start { binary, env } => {
            // A stale socket from a prior boot would block bind. This is a file
            // removal, not a permission change, so it is fine at runtime.
            let home = env
                .iter()
                .find(|(k, _)| k == "HOME")
                .map(|(_, v)| v.as_str())
                .unwrap_or(AGENT_HOME);
            let sock = socket_path(home);
```

How can I resolve this? If you propose a fix, please make it concise.

yeazelm added 2 commits June 25, 2026 15:47
Add a Dockerfile-based source package for AWS Lambda MicroVM image
creation, built by Nix and kept separate from stereOS VM mixtape
artifacts (no stereos.img / qcow2 / kernel). `flake/images.nix` exposes
`packages.<system>.<mixtape>-lambda-microvm-source` for each mixtape.

The image entrypoint is a small Rust HTTP hook server
(`lambda-microvm/lifecycle`), installed as `/bin/lambda-microvm-lifecycle`.
It serves the AWS image-build hooks (`ready`, `validate`) and runtime
hooks (`run`, `suspend`, `resume`, `terminate`), plus direct debug
endpoints, parses the AWS `runHookPayload` envelope down to its `session`
object, and optionally launches `paperd` as uid/gid 1000. The server uses
only a blocking HTTP listener (tiny_http) — no async runtime — and drops
privilege via the kernel at exec, matching the platform's Rust direction
and keeping the audit surface small. Behaviour is locked down by unit
tests over dispatch parsing, hook routing, command execution, and the
paperd start decision.

The rootfs is assembled from a Nix closure. The agent home and its XDG
tree are created and owned 1000:1000 at build time, so the image is ready
to run with no runtime mkdir/chown. OpenSSH paths are included for the
`microvmssh` shell-ingress pattern. Bundles ship the pinned public Paper
release binary and set `STEREOS_START_PAPERD=1`.
The Paper release binary is a glibc-dynamic ELF whose interpreter is the
FHS /lib/ld-linux-*.so path, which does not exist in the Lambda MicroVM
scratch rootfs — only the Nix store glibc does. Without patching, paper
and paperd cannot execute there, so STEREOS_START_PAPERD silently fails
and paperd never comes up despite being installed.

Run autoPatchelfHook so the interpreter and RPATH point at the Nix glibc
(the same loader the lifecycle binary uses), making paper runnable inside
the bundle rootfs.
@yeazelm yeazelm force-pushed the matt/pcc-765-lambda-microvm-source-bundles branch from c6f8eed to 1143610 Compare June 25, 2026 23:18
The first interactive `claude` on a fresh MicroVM spends ~a minute
installing its ~240MB native build before it is usable. Do that work
once, during the AWS `ready` build hook, so it lands in the image
snapshot and every launched MicroVM starts warm. Measured cold first
`claude --version`: ~58s -> ~0.5s.

Adds a generic STEREOS_READY_COMMAND that the lifecycle runs once on the
first `ready` hook, before responding 200 (the snapshot is taken after
ready returns), as the agent user with an agent-rooted login env.
Mixtapes built with `warmAgent = true` ship
/usr/local/bin/stereos-warm-agent, which runs `claude install` (no auth,
no paper) and pre-seeds onboarding so the first interactive run skips the
theme prompt.

The Firecracker snapshot is shared across every MicroVM from the image,
so the script strips the per-machine ids `claude install` writes
(machineID, userID) so they regenerate per-VM. Warm-up failures (e.g. no
build-time network) are non-fatal: the image just ships unwarmed.
@yeazelm yeazelm force-pushed the matt/pcc-765-lambda-microvm-source-bundles branch from da63c36 to 4d4fe38 Compare June 26, 2026 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant