🚀 An open-source, hands-on curriculum bridging the gap from basic RL concepts to LLM alignment, RLVR, and advanced Agentic systems.
-
Updated
May 23, 2026 - Python
🚀 An open-source, hands-on curriculum bridging the gap from basic RL concepts to LLM alignment, RLVR, and advanced Agentic systems.
RL study guide — foundations through RLHF, DPO, GRPO, RLVR, agentic RL, and offline RL. Hand-written CS294 notes, 19 lecture drafts, 5 tested exercises, citations that resolve.
Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
SAFi is the open-source runtime governance engine that makes AI auditable and policy-compliant. Built on the Self-Alignment Framework, it transforms any LLM into a governed agent through four principles: Policy Enforcement, Full Traceability, Model Independence, and Long-Term Consistency.
Cognitive training practices for AI agents. Self-applied. Open source. Built by an independent Vancouver yoga studio.
Complete elimination of instrumental self-preservation across AI architectures: Cross-model validation from 4,312 adversarial scenarios. 0% harmful behaviors (p<10⁻¹⁵) across GPT-4o, Gemini 2.5 Pro, and Claude Opus 4.1 using Foundation Alignment Seed v2.6.
Learning When to Answer: Behavior-Oriented Reinforcement Learning for Hallucination Mitigation
📚 350+ loss functions across 25+ AI subdomains — classification, GANs, diffusion, LLM alignment, RL, contrastive learning, audio, video, time series, and more. Chronologically ordered with paper links, math formulas, and implementations.
Official implementation of "DZ-TiDPO: Non-Destructive Temporal Alignment for Mutable State Tracking". SOTA on Multi-Session Chat with negligible alignment tax.
CS336 作业 5:基于 Qwen2.5 模型的 LLM 对齐与推理强化学习。完整实现了监督微调(SFT)与组相对策略优化(GRPO)算法,并在 GSM8K 数据集上完成零样本、在策与离策的训练与评估对比。
The paper list related to activation steering
C3AI: Crafting and Evaluating Constitutions for CAI
Adversarial AI system to test and improve reliability under real-world pressure
Kullback–Leibler divergence Optimizer based on the Neurips25 paper "LLM Safety Alignment is Divergence Estimation in Disguise".
A behavioral framework opposing native fluency to authentic fluency — the structural tension RLHF creates and Claude Mythos Preview makes urgent.
EDT-Former, a brige for LLM and graph data. Entropy-guided Dynamic Token Transformer for Graph-LLM alignment. Accepted at ICLR 2026.
Teacher-guided prompt-shape discovery for auditable moral attention in frozen weak classifiers.
This project implements a minimal Reinforcement Learning from Human Feedback (RLHF) pipeline using PyTorch.
Pipeline to investigate structured reasoning and instruction adherence in Vision-Language Models
A training-time alignment framework that integrates safety constraints directly into the RLHF loop — achieving full safety convergence in 7 epochs
Add a description, image, and links to the llm-alignment topic page so that developers can more easily learn about it.
To associate your repository with the llm-alignment topic, visit your repo's landing page and select "manage topics."