Skip to content

Add homework/ — course-style problem sets with solutions (HW1 fully built; HW2-7 stubs)#8

Open
0bserver07 wants to merge 1 commit into
masterfrom
claude/add-homework-series
Open

Add homework/ — course-style problem sets with solutions (HW1 fully built; HW2-7 stubs)#8
0bserver07 wants to merge 1 commit into
masterfrom
claude/add-homework-series

Conversation

@0bserver07
Copy link
Copy Markdown
Owner

Real new learning content (not just file moves). Builds the start of a homework/ directory with course-style problem sets — Berkeley CS294 had HW1–HW5, same idea here.

What's actually in this PR

homework/hw01-mdps-and-value-iteration/ — fully built. Seven problems:

  1. Write-down-an-MDP warm-up (deterministic gridworld, slippery gridworld, chain-of-thought-as-MDP).
  2. Derive the Bellman expectation equation from the definition V^π(s) = E[G_t | S_t = s].
  3. Prove value iteration converges (the operator is a γ-contraction in max-norm; conclude existence/uniqueness of V*, geometric convergence rate).
  4. Compute V* on the gridworld in closed form — derive V*_d = -(1 - γ^(d-1))/(1 - γ) + 10·γ^(d-1) and check against d=1, d=2.
  5. Do the value-iteration coding exercise (links to exercises/01-mdps/) — with a sanity check that the printed V* matrix matches the closed form from problem 4.
  6. What γ changes — γ=1 in episodic vs non-terminating, γ=0.5 collapsing far values, average-reward intuition.
  7. Read S&B 4.3-4.4 — policy iteration vs value iteration vs generalized policy iteration.

Full worked solutions in solutions.md (a separate file so you try problems first). Each solution ends with a "what this teaches" line so the takeaway isn't buried in the math.

HW2–HW7 stubs — real planned content, not placeholders:

  • HW2 — Policy gradients & REINFORCE
  • HW3 — Q-learning, DQN, target networks
  • HW4 — Actor-critic & PPO
  • HW5 — Reward modeling, RLHF, DPO
  • HW6 — GRPO & RL with verifiable rewards
  • HW7 — Agentic RL or offline RL (pick one)

Each stub README says exactly what the HW will cover. If HW1's shape is right, I bulk-build the rest the same way.

Marked unreviewed

Per the AGENTS.md convention, an agent doesn't get to promote to reviewed. HW1 should be flipped to reviewed after you work through every problem yourself and confirm the solutions are correct (or tell me what to fix).

Not merged — review first

Tell me: does HW1's shape work? If yes, I bulk-build HW2-7. If you want a different format (more code-heavy, fewer theory problems, notebooks instead, etc.), I pivot before scaling.

🤖 Generated with Claude Code

Builds the start of a real RL course homework series. Berkeley CS294
had HW1 through HW5; same idea here, mapped to the lecture blocks.
Each HW combines theory questions, coding (linked to existing
exercises/), and reading. Solutions in a separate file so you try
problems before looking.

This commit:

- HW1 fully built — MDPs, Bellman equations, value iteration. Seven
  problems: write-down-an-MDP warm-up, derive the Bellman expectation
  equation from the definition of value, prove value iteration is a
  γ-contraction, compute V* on the deterministic gridworld in closed
  form, do the value-iteration coding exercise, work through what γ
  changes, and read S&B 4.3-4.4 on policy iteration vs. value
  iteration. Full worked solutions, ~480 lines combined.
- HW2-HW7 directories with stub READMEs describing what each will
  cover (policy gradients, Q-learning/DQN, actor-critic/PPO,
  RLHF/DPO, GRPO/RLVR, agentic-or-offline). Real planned content,
  not placeholders.
- homework/README.md as the index, with a "why bother with paper-
  and-pencil theory" note.

All marked `unreviewed` per the convention; HW1 should be promoted to
`reviewed` after a person works through every problem and confirms
the solutions are right.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant