Add homework/ — course-style problem sets with solutions (HW1 fully built; HW2-7 stubs) by 0bserver07 · Pull Request #8 · 0bserver07/Study-Reinforcement-Learning

0bserver07 · 2026-05-15T20:54:40Z

Real new learning content (not just file moves). Builds the start of a homework/ directory with course-style problem sets — Berkeley CS294 had HW1–HW5, same idea here.

What's actually in this PR

homework/hw01-mdps-and-value-iteration/ — fully built. Seven problems:

Write-down-an-MDP warm-up (deterministic gridworld, slippery gridworld, chain-of-thought-as-MDP).
Derive the Bellman expectation equation from the definition V^π(s) = E[G_t | S_t = s].
Prove value iteration converges (the operator is a γ-contraction in max-norm; conclude existence/uniqueness of V*, geometric convergence rate).
Compute V* on the gridworld in closed form — derive V*_d = -(1 - γ^(d-1))/(1 - γ) + 10·γ^(d-1) and check against d=1, d=2.
Do the value-iteration coding exercise (links to exercises/01-mdps/) — with a sanity check that the printed V* matrix matches the closed form from problem 4.
What γ changes — γ=1 in episodic vs non-terminating, γ=0.5 collapsing far values, average-reward intuition.
Read S&B 4.3-4.4 — policy iteration vs value iteration vs generalized policy iteration.

Full worked solutions in solutions.md (a separate file so you try problems first). Each solution ends with a "what this teaches" line so the takeaway isn't buried in the math.

HW2–HW7 stubs — real planned content, not placeholders:

HW2 — Policy gradients & REINFORCE
HW3 — Q-learning, DQN, target networks
HW4 — Actor-critic & PPO
HW5 — Reward modeling, RLHF, DPO
HW6 — GRPO & RL with verifiable rewards
HW7 — Agentic RL or offline RL (pick one)

Each stub README says exactly what the HW will cover. If HW1's shape is right, I bulk-build the rest the same way.

Marked `unreviewed`

Per the AGENTS.md convention, an agent doesn't get to promote to reviewed. HW1 should be flipped to reviewed after you work through every problem yourself and confirm the solutions are correct (or tell me what to fix).

Not merged — review first

Tell me: does HW1's shape work? If yes, I bulk-build HW2-7. If you want a different format (more code-heavy, fewer theory problems, notebooks instead, etc.), I pivot before scaling.

🤖 Generated with Claude Code

Builds the start of a real RL course homework series. Berkeley CS294 had HW1 through HW5; same idea here, mapped to the lecture blocks. Each HW combines theory questions, coding (linked to existing exercises/), and reading. Solutions in a separate file so you try problems before looking. This commit: - HW1 fully built — MDPs, Bellman equations, value iteration. Seven problems: write-down-an-MDP warm-up, derive the Bellman expectation equation from the definition of value, prove value iteration is a γ-contraction, compute V* on the deterministic gridworld in closed form, do the value-iteration coding exercise, work through what γ changes, and read S&B 4.3-4.4 on policy iteration vs. value iteration. Full worked solutions, ~480 lines combined. - HW2-HW7 directories with stub READMEs describing what each will cover (policy gradients, Q-learning/DQN, actor-critic/PPO, RLHF/DPO, GRPO/RLVR, agentic-or-offline). Real planned content, not placeholders. - homework/README.md as the index, with a "why bother with paper- and-pencil theory" note. All marked `unreviewed` per the convention; HW1 should be promoted to `reviewed` after a person works through every problem and confirms the solutions are right. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add homework/ — course-style problem sets with solutions (HW1 fully built; HW2-7 stubs)#8

Add homework/ — course-style problem sets with solutions (HW1 fully built; HW2-7 stubs)#8
0bserver07 wants to merge 1 commit into
masterfrom
claude/add-homework-series

0bserver07 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0bserver07 commented May 15, 2026

What's actually in this PR

Marked unreviewed

Not merged — review first

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Marked `unreviewed`