Skip to content

nkapila6/agentic-system-design

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Agentic System Design

A curated collection of research papers, blog posts, tools, and documentation for building agentic systems - AI agents that can reason, plan, use tools, and collaborate autonomously.

Table of Contents

  1. Memory Systems
  2. Sandboxes & Isolation
  3. MCP (Model Context Protocol)
  1. Agent Architectures & Orchestration
  2. Programmatic Tool Calling
  3. Multi-Agent Systems
  1. Planning & Reasoning
  1. WebMCP Protocol
  2. Browser Automation Stacks
  1. State Management
  2. Observability & Debugging
  3. Evaluation & Benchmarking
  4. Error Handling & Recovery
  1. Human-in-the-Loop
  2. Safety & Alignment
  1. Skills & Capabilities

Core Infrastructure

1. Memory Systems

Research Papers (2026)

EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning

  • Authors: Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, Yafeng Deng
  • Publication: January 8, 2026
  • URL: https://arxiv.org/abs/2601.02163
  • Why Relevant: Implements an engram-inspired lifecycle for computational memory, converting dialogue streams into MemCells that capture episodic traces, atomic facts, and time-bounded Foresight signals.

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

  • Authors: Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, Muning Wen
  • Publication: January 12, 2026
  • URL: https://arxiv.org/abs/2601.03192
  • Why Relevant: Proposes a non-parametric approach that evolves via reinforcement learning on episodic memory, decoupling stable reasoning from plastic memory.

Semantic XPath: Structured Agentic Memory Access for Conversational AI

  • Authors: Yifan Simon Liu, Ruifan Wu, Liam Gallagher, Jiazhou Liang, Armin Toroghi, Scott Sanner
  • Publication: March 1, 2026
  • URL: https://arxiv.org/abs/2603.01160
  • Why Relevant: Introduces a tree-structured memory module for conversational AI that improves over flat-RAG baselines by 176.7% while using only 9.1% of the tokens required by in-context memory.

Research Papers (2025)

HiMeS: Hippocampus-inspired Memory System for Personalized AI Assistants

  • Authors: Hailong Li, Feifei Li, Wenhui Que, Xingyu Fan
  • Publication: January 6, 2026
  • URL: https://arxiv.org/abs/2601.06152
  • Why Relevant: Proposes an AI assistant architecture that fuses short-term and long-term memory, inspired by biological hippocampus-neocortex memory mechanisms.

LUMA-RAG: Lifelong Multimodal Agents with Provably Stable Streaming Alignment

  • Authors: Rohan Wandre, Yash Gajewar, Namrata Patel, Vivek Dhalkari
  • Publication: November 4, 2025
  • URL: https://arxiv.org/abs/2511.02371
  • Why Relevant: Presents a lifelong multimodal agent architecture with streaming, multi-tier memory system that dynamically spills embeddings from hot tier to compressed tier under strict memory budgets.

MIRIX: Multi-Agent Memory System for LLM-Based Agents

  • Authors: Yu Wang, Xi Chen
  • Publication: July 10, 2025
  • URL: https://arxiv.org/abs/2507.07957
  • Why Relevant: Introduces a modular, multi-agent memory system with six distinct memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault.

ID-RAG: Identity Retrieval-Augmented Generation for Long-Horizon Persona Coherence in Generative Agents

  • Authors: Daniel Platnick, Mohamed E. Bengueddache, Marjan Alirezaie, Dava J. Newman, Alex ''Sandy'' Pentland, Hossein Rahnama
  • Publication: September 29, 2025
  • URL: https://arxiv.org/abs/2509.25299
  • Why Relevant: Addresses identity drift in long-horizon agents by introducing a mechanism to ground agent personas in dynamic, structured identity models.

A Survey of Context Engineering for Large Language Models

  • Authors: Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu
  • Publication: July 21, 2025
  • URL: https://arxiv.org/abs/2507.13334
  • Why Relevant: Comprehensive 166-page survey analyzing over 1400 research papers, establishing context engineering as a formal discipline for optimizing LLM information payloads.

Research Papers (2024-2022)

ReAct: Synergizing Reasoning and Acting in Language Models

  • Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
  • Publication: October 6, 2022 (Revised March 10, 2023)
  • URL: https://arxiv.org/abs/2210.03629
  • Why Relevant: Foundational paper introducing ReAct pattern, enabling LLMs to generate both reasoning traces and task-specific actions in an interleaved manner.

Production & Engineering Blogs

LangChain: "How we built Agent Builder's memory system"

LangChain: "How to Use Memory in Agent Builder"

Weaviate: "Context Engineering - LLM Memory and Retrieval for AI Agents"

  • Authors: Femke Plantinga, Prajjwal Yadav, Victoria Slocum
  • Publication: December 9, 2025
  • URL: https://weaviate.io/blog/context-engineering
  • Topics: Six pillars of context engineering, short-term vs. long-term agent memory architecture, failure modes (context poisoning/distraction/confusion/clash).

Weaviate: "The Limit in the Loop"

  • Authors: Charles Pierse, Yaru Lin
  • Publication: February 4, 2026
  • URL: https://weaviate.io/blog/limit-in-the-loop
  • Topics: Memory as infrastructure -- write control, deduplication, reconciliation, amendment, and purposeful forgetting for production agent memory.

Zep: "Zep Is The New State of the Art In Agent Memory"

Zep: "The Retrieval Tradeoff: What 50 Experiments Taught Us About Context Engineering"

Zep: "Context Templates: Context Engineering Made Simple"

Letta: "Sleep-time Compute"

Letta: "Agent Memory: How to Build Agents that Learn and Remember"

  • Publication: July 7, 2025
  • URL: https://letta.com/blog/agent-memory
  • Topics: Building agents with persistent memory that learn across interactions, stateless-to-stateful transition.

Letta: "Memory Blocks: The Key to Agentic Context Management"

Letta: "RAG is not Agent Memory"

Letta: "Stateful Agents: The Missing Link in LLM Intelligence"

  • Publication: February 6, 2025
  • URL: https://letta.com/blog/stateful-agents
  • Topics: Persistent memory and learning during deployment as the missing capability for production LLM intelligence.

Letta: "Anatomy of a Context Window: A Guide to Context Engineering"

Letta: "Continual Learning in Token Space"

Letta: "Introducing Context Repositories: Git-based Memory for Coding Agents"

Letta: "Conversations: Shared Agent Memory across Concurrent Experiences"

Letta: "Letta Leaderboard: Benchmarking LLMs on Agentic Memory"

Letta: "Benchmarking AI Agent Memory: Is a Filesystem All You Need?"

LlamaIndex: "Files Are All You Need"

Weaviate: "Building A Legal RAG App in 36 Hours"

  • Author: Femke Plantinga, Victoria Slocum
  • Publication: February 26, 2026
  • URL: https://weaviate.io/blog/legal-rag-app
  • Topics: Practitioner guide on building a production-ready end-to-end RAG application using Weaviate's Query Agent, relevant to RAG memory and context management.

Weaviate: "Introducing Weaviate Agent Skills"

  • Author: Femke Plantinga, Prajjwal Yadav, Victoria Slocum
  • Publication: February 18, 2026
  • URL: https://weaviate.io/blog/weaviate-agent-skills
  • Topics: Introduces agent skills library for building production-ready agent workflows with Weaviate, directly relevant to agentic AI memory and context engineering.

Letta: "Agent Memory: How to Build Agents that Learn and Remember"

  • Publication: July 07, 2025
  • URL: https://www.letta.com/blog/agent-memory
  • Topics: Comprehensive guide on building agents with persistent memory, covering stateless vs stateful paradigms and memory architectures for learning agents.

Letta: "Anatomy of a Context Window: A Guide to Context Engineering"

Letta: "Memory Blocks: The Key to Agentic Context Management"

  • Publication: May 14, 2025
  • URL: https://www.letta.com/blog/memory-blocks
  • Topics: Engineering deep-dive on memory block abstractions for structuring agent context windows into discrete, functional memory units.

Letta: "RAG is not Agent Memory"

  • Publication: February 13, 2025
  • URL: https://www.letta.com/blog/rag-vs-agent-memory
  • Topics: Explains why traditional RAG is insufficient for agent memory and how persistent agent memory differs from retrieval-augmented generation.

Letta: "Stateful Agents: The Missing Link in LLM Intelligence"

  • Publication: February 06, 2025
  • URL: https://www.letta.com/blog/stateful-agents
  • Topics: Introduces stateful agents that maintain persistent memory and learn during deployment, covering the architecture for memory-enabled AI systems.

Letta: "Conversations: Shared Agent Memory across Concurrent Experiences"

  • Publication: January 21, 2026
  • URL: https://www.letta.com/blog/conversations
  • Topics: Product deep-dive on the Conversations API enabling agents to maintain shared memory across parallel concurrent user interactions.

Letta: "Letta Code: A Memory-First Coding Agent"

  • Publication: December 16, 2025
  • URL: https://www.letta.com/blog/letta-code
  • Topics: Introduces a memory-first coding agent that persists state and learns over time, demonstrating persistent agent memory in a coding context.

Letta: "Letta Evals: Evaluating Agents that Learn"

  • Publication: October 23, 2025
  • URL: https://www.letta.com/blog/letta-evals
  • Topics: Open-source evaluation framework for testing stateful agents with persistent memory, measuring how well agents learn and remember.

Letta: "Rearchitecting Letta's Agent Loop: Lessons from ReAct, MemGPT, & Claude Code"

  • Publication: October 14, 2025
  • URL: https://www.letta.com/blog/letta-v1-agent
  • Topics: Engineering deep-dive on Letta's agent architecture redesign incorporating memory management lessons from MemGPT for stateful agents.

Letta: "New course on Letta with DeepLearning.AI"

Letta: "The AI agents stack"

LlamaIndex: "Build Better Context Graphs: Custom Instructions, Search Filters, and Webhooks"

LlamaIndex: "How Zep Works: A Visual Guide to Knowledge Graphs for AI Agents"

LlamaIndex: "Building Voice Agents with Memory: Zep x LiveKit"

  • Publication: September 2025
  • URL: https://blog.getzep.com/zep-livekit/
  • Topics: Production implementation guide for adding long-term persistent memory to voice agents using Zep and LiveKit integration.

LlamaIndex: "Agents That Always Remember What Matters"

LlamaIndex: "How We Scaled Zep 30x in 2 Weeks (and Made It Faster)"

LlamaIndex: "Zep v3: Context Engineering Takes Center Stage"

LlamaIndex: "Graphiti Adds FalkorDB Support as Project Approaches 14,000 Stars"

LlamaIndex: "What is Context Engineering, Anyway?"

LlamaIndex: "The Private Agent Memory Fallacy"

LlamaIndex: "Stop Using RAG for Agent Memory"

LlamaIndex: "Introducing Entity Types: Smarter, Structured Memory for Agents"

LlamaIndex: "Lies, Damn Lies, & Statistics: Is Mem0 Really SOTA in Agent Memory?"

LlamaIndex: "GPT-4.1 and o4-mini: Is OpenAI Overselling Long-Context?"

LlamaIndex: "The One-Token Trick"

LlamaIndex: "Cursor IDE: Adding Memory With Graphiti MCP"

LlamaIndex: "Building a Memory Agent with the OpenAI Agents SDK and Zep"


2. Sandboxes & Isolation

Research Papers & Specifications (2024)

Firecracker Specification Document

Firecracker Design Document

Research Papers & Specifications (2021)

RACK (Recent ACKnowledgement) TCP Loss-Detection Algorithm (RFC 8985)

Blog Posts (2024)

gVisor: "Safe Ride into the Dangerzone: Reducing attack surface with gVisor"

E2B: "How Perplexity implemented advanced data analysis for Pro users in 1 week"

E2B: "How Hugging Face Is Using E2B to Replicate DeepSeek-R1"

E2B: "How Manus Uses E2B to Provide Agents With Virtual Computers"

E2B: "Groq's Compound AI Models Are Powered by E2B"

E2B: "Lindy Powers AI Workflows With E2B Code Action"

Blog Posts (2023)

gVisor: "Optimizing seccomp usage in gVisor"

gVisor: "Faster filesystem access with Directfs"

gVisor: "Running Stable Diffusion on GPU with gVisor"

gVisor: "Rootfs Overlay"

gVisor: "Releasing Systrap - A high-performance gVisor platform"

Blog Posts (2022)

gVisor: "How we Eliminated 99% of gVisor Networking Memory Allocations with Enhanced Buffer Pooling"

gVisor: "Threat Detection in gVisor"

Blog Posts (2021)

gVisor: "Running gVisor in Production at Scale in Ant"

gVisor: "gVisor RACK"

Blog Posts (2020)

gVisor: "Platform Portability"

gVisor: "Containing a Real Vulnerability"

gVisor: "gVisor Networking Security"

gVisor: "gVisor Security Basics - Part 1"

Blog Posts (2018)

AWS Blog: "Firecracker – Lightweight Virtualization for Serverless Computing"

AWS Open Source Blog: "Announcing the Firecracker Open Source Technology: Secure and Fast microVM for Serverless Computing"

Production & Engineering Blogs

E2B: "Firecracker vs QEMU"

E2B: "How I taught an AI to use a computer"

E2B: "Code Interpreter Sandbox"

E2B: "Up to 5x Faster Sandboxes"

E2B: "LLM-powered Code Interpreters"

E2B: "Limitations of Running AI Agents Locally"

Fly.io: "Fly Machines: an API for fast-booting VMs"

  • Author: Kurt Mackey
  • Publication: May 2022
  • URL: https://fly.io/blog/fly-machines/
  • Topics: Firecracker-based microVM API with 300ms boot times, scale-to-zero, per-tenant isolation.

Fly.io: "The Design & Implementation of Sprites"

Fly.io: "Code And Let Live"

Fly.io: "Phoenix.new -- The Remote AI Runtime"

Fly.io: "Our Best Customers Are Now Robots"

  • Author: Kurt Mackey
  • Publication: 2025
  • URL: https://fly.io/blog/fuckin-robots/
  • Topics: AI agents as primary users driving demand for programmatically-created isolated environments.

Fly.io: "The Serverless Server"

Daytona: "Sandbox Firewall"

Daytona: "Securing AI Code: Building Safe Sandboxes with Daytona SDK"

Daytona: "Sandboxing AI Development with Agent-Agnostic Infrastructure"

Daytona: "Harnessing AI through Standardization and Isolation"

Daytona: "Daytona Raises $24M Series A to Give Every Agent a Computer"

Cloudflare: "Mitigating Spectre and Other Security Threats: The Cloudflare Workers Security Model"

Cloudflare: "Introducing Moltworker: a self-hosted personal AI agent"

Cloudflare: "Containers are available in public beta"

Meta Engineering: "Building Private Processing for AI tools on WhatsApp"

Meta Engineering: "Scaling Privacy Infrastructure for GenAI Product Innovation"

Cursor: "Implementing a secure sandbox for local agents"

E2B: "Introducing Build System 2.0"

E2B: "Replicating Cursor's Agent Mode with E2B and AgentKit"

E2B: "JavaScript Guide: Run OpenAI Codex in an E2B Sandbox"

E2B: "Python Guide: Run OpenAI Codex in an E2B Sandbox"

Daytona: "Riza and Daytona Partner to Power AI-Generated Code"

Daytona: "LangChain's Open SWE Runs on Daytona — Here's Why"

Daytona: "Snap, Sandbox, Summarize: Safe Visual LLMs with Daytona"

Daytona: "Computer Use – macOS (Early Access)"

Daytona: "Computer Use – Windows (Early Access)"

Daytona: "Single Tenant Deployment"

  • Author: Ivan Burazin
  • Publication: October 14, 2025
  • URL: https://www.daytona.io/dotfiles/single-tenant
  • Topics: Single tenant deployment model for Daytona sandboxes, relevant to isolation and security architecture for AI agent infrastructure.

Daytona: "Running LLM-Generated Code Safely: LangChain + Daytona Demo"

Daytona: "Run AI-Generated Code Safely with Daytona Sandboxes"

Daytona: "Managing Files in AI Sandbox Environments"

Daytona: "Winning Daytona's Hacksprint with an A/B Testing Agent"

Daytona: "PTY Support in Daytona"


3. MCP (Model Context Protocol)

Standards & Specifications (2025)

MCP Specification (2025-06-18 Revision)

  • URL: https://spec.modelcontextprotocol.io/
  • Description: Complete protocol specification with June 18, 2025 revision.
  • Why Relevant: Defines the open standard for connecting AI applications to external systems.

MCP Authorization Specification

MCP Schema (TypeScript)

MCP Schema (JSON)

OAuth Standards Referenced

OAuth 2.1 IETF DRAFT

OAuth 2.0 Authorization Server Metadata (RFC 8414)

OAuth 2.0 Dynamic Client Registration Protocol (RFC 7591)

OAuth 2.0 Protected Resource Metadata (RFC 9728)

OAuth 2.0 Resource Indicators (RFC 8707)

Documentation

Model Context Protocol Main Website

MCP Registry

MCP Registry GitHub

Anthropic MCP Documentation

Production & Engineering Blogs

Anthropic: "Introducing the Model Context Protocol"

Anthropic: "Code execution with MCP: Building more efficient agents"

Anthropic: "Desktop Extensions: One-click MCP server installation"

Cloudflare: "Build and deploy Remote MCP servers to Cloudflare"

Cloudflare: "Securing the AI Revolution: Introducing Cloudflare MCP Server Portals"

Cloudflare: "Code Mode: give agents an entire API in 1,000 tokens"

  • Author: Matt Carey
  • Publication: February 20, 2026
  • URL: https://blog.cloudflare.com/code-mode-mcp/
  • Topics: MCP server for entire Cloudflare API (2,500+ endpoints) collapsed into 2 tools -- 99.9% token reduction.

Vercel: "The second wave of MCP: Building for LLMs, not developers"

Vercel: "Building efficient MCP servers"

Vercel: "Addressing security and quality issues with MCP tools in AI Agents"

Vercel: "How Vapi built their MCP server on Vercel"

LangChain: "MCP: Flash in the Pan or Future Standard?"

  • Authors: Harrison Chase, Nuno Campos
  • Publication: March 8, 2025
  • URL: https://blog.langchain.dev/mcp-fad-or-fixture/
  • Topics: Debate on MCP's value -- useful for tools in agents you don't control, limited by tool selection reliability.

LlamaIndex: "Skills vs MCP tools for agents: when to use what"

LlamaIndex: "Adding Native MCP to LlamaIndex Docs"

Simon Willison: "Introducing the Model Context Protocol"

Metadata

Authors: David Soria Parra (@dsp) and Justin Spahr-Summers (@jspahrsummers) License: Apache License 2.0 for code and specifications, Creative Commons Attribution 4.0 International for documentation Governance: Model Context Protocol as a Series of LF Projects, LLC Governance Policies: https://www.lfprojects.org/policies/

Vercel: "AI SDK 6"

  • Publication: December 22, 2025
  • URL: https://vercel.com/blog/ai-sdk-6
  • Topics: Introduces AI SDK 6 with full MCP support, agents, tool execution approval, and DevTools for production use.

Vercel: "Build smarter workflows with Notion and v0"

Vercel: "Security boundaries in agentic architectures"

Vercel: "Run untrusted code with Vercel Sandbox, now generally available"

Vercel: "Making agent-friendly pages with content negotiation"

Vercel: "How we made v0 an effective coding agent"

Vercel: "How to build agents with filesystems and bash"

Vercel: "How Mux shipped durable video workflows with their @mux/ai SDK"

Vercel: "Cline now runs on Vercel AI Gateway"


Agent Architecture & Orchestration

4. Agent Architectures & Orchestration

Research Papers (2026)

The Auton Agentic AI Framework

  • Authors: Sheng Cao, Zhao Chang, Chang Li, Hannan Li, Liyao Fu, Ji Tang
  • Publication: February 27, 2026
  • URL: https://arxiv.org/abs/2602.23720
  • Why Relevant: Describes a principled architecture for standardizing autonomous agent systems, with hierarchical memory consolidation inspired by biological episodic memory.

Research Papers (2025)

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

  • Authors: Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, Bryan Hooi
  • Publication: October 15, 2025
  • URL: https://arxiv.org/abs/2510.13220
  • Why Relevant: Introduces evolutionary test-time learning framework that improves agents without fine-tuning by evolving the entire agentic system after every episode.

Research Papers (2023)

Generative Agents: Interactive Simulacra of Human Behavior

  • Authors: J. Z. Shunyu Yao, Jiacheng Li, Yuyang Zhao, Izhak Shafran, Karthik Narasimhan
  • Publication: April 2023
  • URL: https://arxiv.org/abs/2304.03442
  • Why Relevant: Simulation of human-like agent behavior in interactive environments.

Agents: An Open-source Framework for Autonomous Language Agents

  • Authors: Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang, Xiangru Tang, Ningyu Zhang, Huajun Chen, Peng Cui, Mrinmaya Sachan
  • Publication: September 14, 2023 (Revised December 12, 2023)
  • URL: https://arxiv.org/abs/2309.07870
  • Why Relevant: Open-source library enabling non-specialists to build state-of-the-art autonomous language agents with planning, memory, tool usage, and multi-agent communication.

Research Papers (2010-2008)

Pregel: A System for Large-Scale Graph Processing

  • Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski
  • Publication: Proceedings of the 2010 International Conference on Management of Data, ACM, New York, NY, USA, pp. 135-146
  • URL: https://research.google/pubs/pub37252/
  • Referenced by: LangGraph
  • Why Relevant: Foundational paper on distributed graph processing that inspired LangGraph's architecture.

Exploring Network Structure, Dynamics, and Function Using NetworkX

Documentation

LangGraph Documentation

CrewAI Documentation

Production & Engineering Blogs

Anthropic: "Building effective agents"

Anthropic: "Effective harnesses for long-running agents"

Anthropic: "Effective context engineering for AI agents"

Anthropic: "Claude Code: Best practices for agentic coding"

Anthropic: "Beyond permission prompts: making Claude Code more secure and autonomous"

LangChain: "Agent Engineering: A New Discipline"

LangChain: "On Agent Frameworks and Agent Observability"

LangChain: "You don't know what your agent will do until it's in production"

LangChain: "Improving Deep Agents with harness engineering"

LangChain: "The two patterns by which agents connect sandboxes"

LangChain: "LangChain and LangGraph Agent Frameworks Reach v1.0 Milestones"

LlamaIndex: "LlamaAgents Builder: Idea To Deployed Agent in Minutes"

LlamaIndex: "Long Horizon Document Agents"

Cursor: "The third era of AI software development"

Cursor: "Cursor agents can now control their own computers"

Cursor: "Build agents that run automatically"

Cognition: "An Early Preview of SWE-1.6 and Research Update"

  • Authors: Carlo Baronio, Ben Pan, Sam Lee, et al.
  • Publication: March 1, 2026
  • URL: https://www.cognition.ai/blog/swe-1-6-preview
  • Topics: Latest agent model optimized for software engineering with improved planning and execution.

Cognition: "Rebuilding Devin for Claude Sonnet 4.5: Lessons and Challenges"

Cognition: "Devin's 2025 Performance Review"

Cognition: "Closing the Agent Loop: Devin Autofixes Review Comments"

Replit: "Introducing Agent 3: Our Most Autonomous Agent Yet"

LinkedIn: "Contextual agent playbooks and tools"

Stripe: "Can AI agents build real Stripe integrations?"

Sourcegraph: "A New Era for Sourcegraph: The Intelligence Layer for AI Coding Agents"

Microsoft Research: "CORPGEN advances AI agents for real work"

Microsoft Research: "Agent Lightning: Adding RL to AI agents without code rewrites"

OpenAI: "Practices for Governing Agentic AI Systems"

Anthropic: "Quantifying infrastructure noise in agentic coding evals"

Anthropic: "Demystifying evals for AI agents"

Anthropic: "Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet"

LlamaIndex: "Creating a Deal Sourcing Agent with LlamaAgents Builder"

LlamaIndex: "LlamaIndex is more than a RAG Framework. It is Agentic Document Processing."

Cursor: "Build agents that run automatically"

  • Publication: March 05, 2026
  • URL: https://cursor.com/blog/automations
  • Topics: Cursor's implementation of automated agent triggers and orchestration for coding agents in production.

Cursor: "Cursor is now available in JetBrains IDEs"

  • Publication: March 04, 2026
  • URL: https://cursor.com/blog/jetbrains-acp
  • Topics: Agent Client Protocol (ACP) for integrating coding agents across IDEs, relevant to agent architecture and orchestration patterns.

Cursor: "The third era of AI software development"

  • Publication: February 26, 2026
  • URL: https://cursor.com/blog/third-era
  • Topics: Vision and architecture of autonomous cloud coding agents handling larger tasks over longer timescales in production.

Cursor: "Closing the code review loop with Bugbot Autofix"

Cursor: "Cursor agents can now control their own computers"

Cursor: "Implementing a secure sandbox for local agents"

Sourcegraph: "CodeScaleBench: Testing Coding Agents on Large Codebases and Multi-Repo Software Engineering Tasks"

Sourcegraph: "Building DataBot: Our Always-On Data Assistant"


5. Programmatic Tool Calling

Research Papers (2023)

Toolformer: Language Models Can Teach Themselves to Use Tools

  • Authors: Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom
  • Publication: February 2023
  • URL: https://arxiv.org/abs/2302.04761
  • Why Relevant: Framework for learning to use tools via self-supervised learning, enabling LLMs to call external APIs.

Documentation

OpenAI Function Calling

Anthropic Tool Use

MCP Tool Orchestration

  • URL: https://modelcontextprotocol.io/
  • Description: Open-source standard for connecting AI applications to external systems.
  • Why Relevant: Standardized protocol for tool orchestration across different AI applications.

Best Practices & Patterns

Tool Selection Strategy

  • Choose tools based on agent capabilities and task requirements
  • Implement tool capabilities and descriptions clearly
  • Use appropriate tool parameters and return types

Tool Error Handling

  • Implement robust error handling for tool calls
  • Use try-catch patterns and retry mechanisms
  • Handle tool failures gracefully with fallback strategies

Tool Authorization

  • Implement proper security controls for tool access
  • Use authentication and authorization for sensitive tools
  • Audit and log tool usage for security

Tool Parallelization

  • Optimize performance through parallel tool calls
  • Batch independent tool calls when possible
  • Use async/await patterns for concurrent tool execution

Production & Engineering Blogs

Anthropic: "Writing effective tools for agents -- with agents"

Anthropic: "Introducing advanced tool use on the Claude Developer Platform"

Anthropic: "The 'think' tool: Enabling Claude to stop and think"


6. Multi-Agent Systems

Research Papers (2023)

LLM Powered Autonomous Agents (Lilian Weng)

Online Courses

Multi AI Agent Systems with CrewAI

Practical Multi AI Agents and Advanced Use Cases

Best Practices & Patterns

Agent Communication Protocols

  • Define clear communication patterns between agents
  • Use structured message formats
  • Implement message routing and filtering

Task Distribution

  • Decompose complex tasks across multiple agents
  • Use specialized agents for different capabilities
  • Implement task queue and scheduling

Agent Orchestration

  • Use a coordinator agent for complex workflows
  • Implement supervisor pattern for task delegation
  • Use event-driven architecture for agent coordination

Conflict Resolution

  • Implement mechanisms to resolve agent conflicts
  • Use consensus algorithms for decision making
  • Handle competing agent requests gracefully

Production & Engineering Blogs

Anthropic: "How we built our multi-agent research system"

  • Authors: Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, Daniel Ford
  • Publication: June 13, 2025
  • URL: https://www.anthropic.com/engineering/multi-agent-research-system
  • Topics: Production multi-agent orchestrator-worker system for Research, including prompt engineering for delegation and reliability challenges.

Anthropic: "Building a C compiler with a team of parallel Claudes"

Google Research: "Towards a science of scaling agent systems"


Planning & Reasoning

7. Planning & Reasoning

Research Papers (2023)

Chain of Thought (CoT) Prompting Elicits Reasoning in Large Language Models

  • Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou
  • Publication: January 2023
  • URL: https://arxiv.org/abs/2201.11903
  • Why Relevant: Foundational paper introducing chain-of-thought prompting for enhanced reasoning in complex tasks.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

  • Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Yuan Cao, Karthik Narasimhan
  • Publication: May 2023
  • URL: https://arxiv.org/abs/2305.10601
  • Why Relevant: Systematic exploration of reasoning paths through tree-based search and backtracking.

ReAct: Synergizing Reasoning and Acting in Language Models

  • Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
  • Publication: October 2022 (Revised March 2023)
  • URL: https://arxiv.org/abs/2210.03629
  • Why Relevant: Integrates reasoning and acting for knowledge tasks, enabling LLMs to generate both reasoning traces and task-specific actions.

Reflexion: Language Agents with Verbal Reinforcement Learning

  • Authors: Noah Shinn, Federico Cassano, Edward Grefenstette, Tim Rocktäschel, Yoram Bachrach
  • Publication: March 2023
  • URL: https://arxiv.org/abs/2303.11366
  • Why Relevant: Self-reflection framework for improving agent performance through verbal reinforcement learning.

Research Papers (2022-2023)

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

  • Authors: Qingying Xiao, Kaiwen Wen, Yanchen Deng, Haobo Du, Qianlan Yang, Yuhui Wu, Wenjie Ruan, Chaojie Wang
  • Publication: April 2023
  • URL: https://arxiv.org/abs/2304.11477
  • Why Relevant: Integrates classical planners for long-horizon planning with LLM reasoning.

Algorithm Distillation

  • Authors: H. Jerry Qi, Lulwah Al-Khulaifi, Brian Ichter, J. Z. Shunyu Yao, Karthik Narasimhan, Izhak Shafran, Yuan Cao
  • Publication: October 2022
  • URL: https://arxiv.org/abs/2210.14215
  • Why Relevant: Learn algorithms from trajectories, enabling LLMs to execute complex algorithms.

Best Practices

Prompt Engineering

  • Structure prompts for better reasoning
  • Use chain-of-thought prompting for complex tasks
  • Implement few-shot learning with examples

Planning Strategies

  • Design prompts for effective task decomposition
  • Use hierarchical planning for complex goals
  • Implement subgoal decomposition and tracking

Multi-step Reasoning

  • Implement complex reasoning chains
  • Use tree-of-thoughts for exploring multiple paths
  • Implement backtracking and revision strategies

Production & Engineering Blogs

Replit: "Decision-Time Guidance: Keeping Replit Agent Reliable"

Cognition: "Introducing SWE-grep and SWE-grep-mini: RL for Multi-Turn, Fast Context Retrieval"

  • Authors: Ben Pan, Carlo Baronio, et al.
  • Publication: October 16, 2025
  • URL: https://www.cognition.ai/blog/swe-grep
  • Topics: RL-trained agentic models for parallel multi-turn context retrieval, matching frontier models at 10x less time.

Google Research: "Teaching LLMs to reason like Bayesians"

Microsoft Research: "Multimodal RL with agentic verifier for AI agents"


Web & Browser

8. WebMCP Protocol

See the Tools & Repositories table for all WebMCP GitHub repositories.


9. Browser Automation Stacks

See the Tools & Repositories table for all browser automation tools and repositories.

Production & Engineering Blogs

Browserbase: "Building the future of web automation"

Browserbase: "We built caching into Stagehand. Here's how it works"

Browserbase: "Your AI browser is one malicious div away from going rogue"

Browserbase: "How we built Browserbase Functions"

Browserbase: "The best browser automation framework, in every language"

Browserbase: "How Amplitude Transformed Sales Demos with AI-Powered Browser Automation"

Steel: "Introducing Steel CLI v0.2.0: Browser Automation Built for Agents"

Steel: "Reducing False Positives for Production Agents"

Steel: "How Websites Decide You're Human"

Steel: "Profiles: Your Agent's Persistent Identity"

  • Publication: 2025
  • URL: https://steel.dev/blog/profiles
  • Topics: Persistent browser profiles (auth, cookies, cache) across sessions for authenticated agent access.

Steel: "Agent Logs: Action Traces for Agent Actions"

TinyFish: "OpenAI Operator scores 43% on hard web tasks. We scored 81%."

TinyFish: "Codified Learning: The Backbone of Reliable, Scalable Enterprise Web Agents"

TinyFish: "Proving I'm Human (When I'm Not)"

Microsoft Research: "Magentic-UI, an experimental human-centered web agent"

Microsoft Research: "Magma: A foundation model for multimodal AI agents"

Browserbase: "Introducing Browserbase Functions"

Browserbase: "Browserbase & Fingerprint.js: Tackling fraud with agent identity"

Browserbase: "This week we fixed the worst part of Browserbase"

Steel: "Happy Path for Automating the Web with Steel"

Steel: "Notte on Steel: Browser Infrastructure for Agents"

  • Publication: 2025
  • URL: https://steel.dev/blog/notte-on-steel
  • Topics: Integration guide connecting Notte AI agents to Steel browser infrastructure via CDP with session replay and CAPTCHA solving.

Steel: "Steel vs Kernel: a practical comparison"

Steel: "Steel vs Browserbase: a practical comparison"

Steel: "What is a CAPTCHA solver"

Steel: "Remote Browser Benchmark"

Steel: "Steel Launch Week v2: Everything We Shipped"

TinyFish: "Open AI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs."

  • Author: TinyFish Storytellers
  • Publication: February 12, 2026
  • URL: https://www.tinyfish.ai/blog/mind2web
  • Topics: Benchmark comparison of web agent performance on complex browser tasks, comparing TinyFish's web agent against OpenAI Operator with detailed run data.

TinyFish: "Gemini 3.0 Flash + Mino API: When Reasoning Meets Real Execution"

TinyFish: "Proving I'm Human (When I'm Not)"

TinyFish: "Codified Learning: The Backbone of Reliable, Scalable Enterprise Web Agents"

TinyFish: "The Era of Abundant Intelligence"

  • Author: Sudheesh Nair
  • Publication: December 15, 2025
  • URL: https://www.tinyfish.ai/blog/part-1-the-robotic-web
  • Topics: Infrastructure vision for making the web operable for AI agents, replacing brittle DOM interactions with stable contracts for browser-based execution.

TinyFish: "Why 90% of the Internet Is Invisible (And Why AI Hasn't Fixed It)"

TinyFish: "The Web Outgrew the Browser"


Operations & Observability

10. State Management

Documentation

LangGraph Memory

LangGraph Checkpointing

Research Papers (2026)

ReMe: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

MemOS: Memory Operating System for AI System

  • Publication: January 2026
  • URL: https://arxiv.org/abs/2507.03724
  • Why Relevant: Memory operating system for AI systems with efficient memory retrieval and storage.

Best Practices & Patterns

Context Window Management

  • Optimize context usage for efficient token consumption
  • Use context compression techniques
  • Implement context summarization and pruning

Memory Compression

  • Implement efficient storage and retrieval
  • Use compression algorithms for long-term memory
  • Implement hierarchical memory storage

State Persistence

  • Implement reliable state management
  • Use checkpointing for fault tolerance
  • Implement state synchronization across agents

Production & Engineering Blogs

LangChain: "How we built Agent Builder's memory system"

Letta: "Conversations: Shared Agent Memory across Concurrent Experiences"

Letta: "Introducing Context Repositories: Git-based Memory for Coding Agents"


11. Observability & Debugging

Documentation

Phoenix Documentation

Best Practices & Patterns

Structured Logging

  • Use structured logs for better debugging
  • Implement consistent log formats
  • Use log levels appropriately

Span Management

  • Implement efficient span management
  • Use span context for tracing
  • Implement span sampling for high-volume systems

Trace Sampling

  • Use sampling for high-volume systems
  • Implement intelligent sampling strategies
  • Use span filtering for relevant traces

Production & Engineering Blogs

Braintrust: "The Three Pillars of AI Observability"

Braintrust: "Building Observable AI Agents with Temporal"

LangChain: "Agent Observability Powers Agent Evaluation"

Langfuse: "Trace Complex LLM Applications with the Langfuse Decorator"

Honeycomb: "How Honeycomb Supercharges OpenTelemetry for AI"

Honeycomb: "AI in Production Is Growing Faster Than We Can Trust It"

Honeycomb: "Observability in a World of AI-Generated Code"

Honeycomb: "Measuring Claude Code ROI and Adoption in Honeycomb"

Humanloop: "What is LLM Observability and Monitoring?"

Arize: "How America First Credit Union Built a GenAI Decision Explainer"

LlamaIndex: "Observability in Agentic Document Workflows"

Braintrust: "Automatically discover what matters in your production traces with Topics"

Braintrust: "Trace keynote recap: See it, improve it, optimize it"

Braintrust: "AI observability beyond Python and TypeScript"

Braintrust: "Brainstore makes AI observability at scale possible"

Braintrust: "Braintrust Java SDK: AI observability and evals for the JVM"

Braintrust: "Brainstore: the database designed for the AI engineering era"

Braintrust: "New monitor page for easy analytics"

Honeycomb: "Observability with AI? Honeycomb with AI!"

Arize: "How to Evaluate Tool-Calling Agents"

Arize: "Best AI Observability Tools for Autonomous Agents in 2026"

Arize: "Add Observability to Your Open Agent Spec Agents with Arize Phoenix"

Arize: "AI Agent Debugging: Four Lessons from Shipping Alyx to Production"

Arize: "Mastering Production RAG with Google ADK and Arize AX for Enterprise Knowledge Systems"

Arize: "Closing the Loop: Coding Agents, Telemetry, and the Path to Self-Improving Software"

Arize: "Inside Typeform's AI Agent Stack"

Arize: "How Nebulock Democratizes Threat Hunting"


12. Evaluation & Benchmarking

Documentation

OpenAI Evals Documentation

RAGAS Documentation

Academic Benchmarks

CoQA (Conversational Question Answering)

LAMBADA

MMLU (Massive Multitask Language Understanding)

API-Bank: A Benchmark for Tool-Augmented LLMs

Best Practices & Patterns

Evaluation Metrics

  • Define appropriate metrics for agent tasks
  • Use task-specific evaluation criteria
  • Combine multiple metrics for comprehensive evaluation

A/B Testing

  • Compare different models systematically
  • Use statistical significance testing
  • Implement controlled experiments

Human Evaluation

  • Incorporate human feedback in evaluation
  • Use expert annotation for quality assessment
  • Implement evaluation pipelines with human review

Production & Engineering Blogs

Applied LLMs: "What We've Learned From A Year of Building with LLMs"

  • Authors: Eugene Yan, Bryan Bischof, Charles Frye, Hamel Husain, Jason Liu, Shreya Shankar
  • Publication: June 8, 2024
  • URL: https://applied-llms.org/
  • Topics: Seminal practitioner guide covering eval strategies, LLM-as-Judge pitfalls, HITL design, guardrails, and hallucination mitigation.

Hamel Husain: "Your AI Product Needs Evals"

Hamel Husain: "Using LLM-as-a-Judge For Evaluation: A Complete Guide"

Hamel Husain: "LLM Evals: Everything You Need to Know"

Hamel Husain: "A Field Guide to Rapidly Improving AI Products"

Hamel Husain: "Evals Skills for Coding Agents"

Braintrust: "Evaluating Agents"

Braintrust: "Five Hard-Learned Lessons About AI Evals"

LangChain: "Evaluating Skills"

LangChain: "monday Service + LangSmith: Building a Code-First Evaluation Strategy"

Deepchecks: "Know Your Agent (KYA): From Zero to a Full Strengths & Weaknesses Report"

Deepchecks: "LLM-as-a-Judge Calibration: When Automated Evaluation Goes Wrong"

Humanloop: "LLM as a Judge"

Arize: "How TheFork Leverages Online Evals To Boost Conversions"

Braintrust: "The 5 pillars of AI model performance"

Braintrust: "Testing if "bash is all you need""

Braintrust: "Measuring what matters: An intro to AI evals"

Braintrust: "Claude Sonnet 4.5 analysis"

Braintrust: "A/B testing can't keep up with AI"

Braintrust: "Braintrust is not an eval framework"

Braintrust: "Eval playgrounds for faster, focused iteration"

Braintrust: "Webinar recap: Eval best practices"

Braintrust: "Evaluating Gemini models for vision"

Braintrust: "I ran an eval. Now what?"

Braintrust: "What to do when a new AI model comes out"

LangChain: "How to Improve LLM Evaluation Systems"

LangChain: "Start Right with Deepchecks: Agent Evaluation Out-of-the-Box"

LangChain: "RAG Evaluation Metrics: Answer Relevancy, Faithfulness, and Real-World Accuracy"

LangChain: "Top LLM Evaluation Benchmarks and How They Work"

LangChain: "LLM Optimization: How to Maximize LLM Performance"

Humanloop: "5 LLM Evaluation Tools You Should Know in 2025"


13. Error Handling & Recovery

Best Practices & Patterns

Retry Mechanisms

  • Exponential backoff with jitter
  • Implement configurable retry strategies
  • Use retry decorators and middleware

Circuit Breaker Patterns

  • Detect cascading failures and switch approaches
  • Implement circuit breaker state machine
  • Use timeout and fallback mechanisms

Graceful Degradation

  • Gradually reduce resource usage on failures
  • Implement fallback strategies
  • Use degraded mode for critical operations

Error Classification

  • Categorize errors by type for appropriate handling
  • Implement error handling by error category
  • Use error codes and messages for debugging

Fallback Strategies

  • Implement alternative approaches when tools fail
  • Use multiple tool providers
  • Implement caching for fallback results

Idempotency Keys

  • Prevent duplicate operations
  • Use idempotency keys for critical operations
  • Implement deduplication logic

Production & Engineering Blogs

Braintrust: "Resilient Observability by Design"

Braintrust: "Debugging Ralph Wiggum with Braintrust Logs"

Hamel Husain: "Debugging AI With Adversarial Validation"

Deepchecks: "LLM Hallucination Detection and Mitigation: Best Techniques"

Deepchecks: "Retrieval Quality vs. Answer Quality: Why RAG Evaluation Often Fails"

Deepchecks: "Why Chunking Is Important for AI and RAG Applications"

Deepchecks: "RAG vs. Prompt Engineering – How to Choose Between Them"

Deepchecks: "Unlocking AI Potential with Multi-Agent Orchestration: Proven Patterns and Frameworks"

Deepchecks: "LLM Cost Optimization: How to Maximize AI Efficiency and Save Money"


Safety & Human Interaction

14. Human-in-the-Loop

Documentation

LangGraph Interrupts

OpenAI Human Feedback

Anthropic Approvals

Best Practices & Patterns

Approval Workflows

  • Design approval mechanisms for agent actions
  • Implement approval gates for critical operations
  • Use approval queues for human review

Feedback Loops

  • Continuous improvement through human feedback
  • Implement feedback collection mechanisms
  • Use feedback for model fine-tuning

Override Mechanisms

  • Allow human intervention in agent decisions
  • Implement manual override capabilities
  • Use kill switches for emergency shutdown

User Interface Design

  • Clear communication for human operators
  • Implement intuitive approval interfaces
  • Use progress indicators and status updates

Production & Engineering Blogs

Braintrust: "Evals Are a Team Sport: How We Built Loop"

Braintrust: "Turn Production Data into Better AI with Loop"

Anthropic: "Measuring AI Agent Autonomy in Practice"

Anthropic: "Disempowerment Patterns in Real-World AI Usage"

Humanloop: "AI Is Blurring the Line Between PMs and Engineers"


15. Safety & Alignment

Research Papers (2023)

Constitutional AI: Harmlessness from Large Language Models

  • URL: https://arxiv.org/abs/2307.07407
  • Publication: July 2023
  • Why Relevant: Foundational paper on constitutional AI principles for ensuring AI harmlessness and alignment.

Documentation

Anthropic AI Safety

AI Safety Research

  • URL: https://www.alignresearch.org
  • Description: Research organizations working on AI safety.
  • Why Relevant: Comprehensive AI safety research resources.

Best Practices & Patterns

Safety Guidelines

  • Safety best practices for AI development
  • Implement content filtering and moderation
  • Use safety classifiers and guardrails

Risk Assessment

  • Systematic risk evaluation
  • Implement risk scoring and mitigation
  • Use risk matrices for decision making

Adversarial Testing

  • Testing against adversarial attacks
  • Implement red teaming exercises
  • Use adversarial examples for robustness testing

Production & Engineering Blogs

Anthropic: "Constitutional Classifiers: Defending Against Universal Jailbreaks"

Anthropic: "Alignment Faking in Large Language Models"

Anthropic: "The Persona Selection Model"

OpenAI: "Updated Preparedness Framework"

OpenAI: "An Update on Disrupting Deceptive Uses of AI"

Guardrails AI: "Guardrails AI and NVIDIA NeMo Guardrails"

Guardrails AI: "Introducing Snowglobe"

Guardrails AI: "Scaling AI Safety Testing for Educational Applications"

Guardrails AI: "Introducing the AI Guardrails Index"

Deepchecks: "Prompt Injection vs. Jailbreaks: Key Differences"

Anthropic: "Claude's new constitution"

Anthropic: "An update on our model deprecation commitments for Claude Opus 3"

Guardrails AI: "Guardrails x MLflow: Deterministic Safety, PII, and Quality Validators as GenAI Scorers"

Guardrails AI: "Guardrails AI and NVIDIA NeMo Guardrails - A Comprehensive Approach to AI Safety"

Guardrails AI: "Scaling AI Safety Testing for Educational Applications"

Guardrails AI: "Introducing the AI Guardrails Index"


Capabilities

16. Skills & Capabilities

Official Anthropic Resources (2025)

Claude Skills Official Announcement

Skills Explained (Official)

Agent Skills Standard

  • URL: http://agentskills.io
  • Description: Specification for agent skills.
  • Why Relevant: Standard specification for defining agent skills.

Documentation

Claude Developer Platform

Research Papers (2026)

KAPSO: A Knowledge-grounded framework for Autonomous Program Synthesis and Optimization

  • Authors: Alireza Nadafian, Alireza Mohammadshahi, Majid Yazdani
  • Publication: January 31, 2026
  • URL: https://arxiv.org/abs/2601.21526
  • Why Relevant: Modular framework for autonomous program synthesis with git-native experimentation engine, knowledge system ingesting heterogeneous sources, and cognitive memory layer with episodic store of reusable lessons.

CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System

  • Authors: Zexin Lin, Jiachen Yu, Haoyang Zhang, Yuzhao Li, Zhonghang Li, Yujiu Yang, Junjie Wang, Xiaoqiang Ji
  • Publication: February 4, 2026
  • URL: https://arxiv.org/abs/2602.05004
  • Why Relevant: Casts peer collaboration as closed-loop optimization with Skill-Agent executing via HTN-based skill retrieval from structured skill library, and Co-Optimizer performing patch-style skill consolidation.

Blog Posts (2025)

Simon Willison: "Claude Skills are awesome, maybe a bigger deal than MCP"

Jesse Vincent: "Superpowers"

Jesse Vincent: "Naming Claude Plugins"

Anthropic: "Equipping Agents for the Real World with Agent Skills"

Vercel: "Agent skills explained: An FAQ"

LlamaIndex: "Skills vs MCP tools for agents: when to use what"

Production & Engineering Blogs

Vercel: "Building Slack agents can be easy"

Vercel: "Skills Night: 69,000+ ways agents are getting smarter"

Vercel: "AGENTS.md outperforms skills in our agent evals"

Vercel: "We removed 80% of our agent's tools"

Vercel: "How we built AEO tracking for coding agents"

Vercel: "Anyone can build agents, but it takes a platform to run them"

Vercel: "Testing if 'bash is all you need'"

Vercel: "Introducing: React Best Practices"


Appendix

Tools & Repositories

All tools, SDKs, libraries, and repositories referenced throughout this document are consolidated below.

Memory Systems

Name URL Description
MCP Memory Server GitHub Knowledge graph-based persistent memory system for AI agents
Elasticsearch Memory GitHub Persistent memory with hierarchical categorization and semantic search
Neo4j Agent Memory GitHub Memory management using Neo4j knowledge graphs
LangGraph Memory Docs Memory management for stateful agents with checkpointing
Mem0 arXiv Memory operating system for large models

Sandboxes & Isolation

Name URL Description
E2B GitHub Open-source secure sandboxes for code execution with real-time collaboration
gVisor GitHub Application kernel for containers providing secure isolation boundary
Firecracker GitHub Lightweight microVMs with 125ms startup and 5 MiB memory overhead
Docker-in-Docker GitHub Docker-in-Docker for secure containerization
Kata Containers GitHub Kata Containers with Firecracker support
Flintlock GitHub Firecracker-based container runtime
Firecracker-containerd GitHub containerd integration for Firecracker

MCP SDKs & Libraries

Name URL Description
MCP TypeScript SDK GitHub Official TypeScript implementation (11.8k stars)
MCP Python SDK GitHub Official Python implementation (22k stars)
MCP Go SDK GitHub Official Go implementation (4k stars)
MCP C# SDK GitHub Official C# implementation (4k stars)
MCP Java SDK GitHub Official Java implementation
MCP Kotlin SDK GitHub Official Kotlin implementation
MCP PHP SDK GitHub Official PHP implementation
MCP Ruby SDK GitHub Official Ruby implementation
MCP Rust SDK GitHub Official Rust implementation (3.1k stars)
MCP Swift SDK GitHub Official Swift implementation

MCP NPM Packages

Package Description
@modelcontextprotocol/server Build MCP servers (requires zod v4)
@modelcontextprotocol/client Build MCP clients (requires zod v4)
@modelcontextprotocol/node Node.js Streamable HTTP transport wrapper
@modelcontextprotocol/express Express helpers with Host header validation
@modelcontextprotocol/hono Hono helpers with JSON body parsing and validation

MCP Servers

Name URL Description
MCP Servers (Reference) GitHub Reference implementations (80.2k stars)
MCP Inspector GitHub Visual testing tool for MCP servers (8.9k stars)
GitHub MCP GitHub Official GitHub integration
Notion MCP GitHub Notion integration
Slack MCP GitHub Slack messaging and channel management
Filesystem MCP GitHub Secure file operations
Memory MCP GitHub Knowledge graph-based persistent memory
Brave Search MCP GitHub Web search integration
Puppeteer MCP GitHub Browser automation via Puppeteer
PostgreSQL MCP GitHub PostgreSQL database integration
SQLite MCP GitHub SQLite database integration
Sequential Thinking MCP GitHub Chain-of-thought reasoning

Agent Frameworks & Orchestration

Name URL Description
LangGraph GitHub Graph-based framework for stateful agents with control flow and memory
LangGraph.js GitHub JavaScript/TypeScript implementation of LangGraph
CrewAI GitHub Role-playing autonomous agent framework with crew-based collaboration
AutoGPT GitHub Autonomous agent for complex tasks
BabyAGI GitHub Task management and autonomous agent framework
GPT-Engineer GitHub Software development autonomous agent
OpenClaw Multi-Agent Team GitHub Multi-agent team framework with Blackboard coordination (31 stars)
Agent Protocol Website Standard for agent communication and interoperability
LangChain Deep Agents GitHub LangChain's advanced agent framework with planning capabilities
AgentScope GitHub Multi-agent simulation framework with evaluation capabilities
LangChain GitHub Comprehensive tool use patterns and integrations
Toolformer GitHub Reference implementation of the Toolformer paper

Observability & Evaluation

Name URL Description
Phoenix GitHub Open-source AI Observability & Evaluation platform (8.7k stars)
LangSmith Website LangChain's observability and evaluation platform
Weights & Biases Website Experiment tracking for ML models
AgentOps Website Observability for AI agents
OpenTelemetry GitHub Open-source observability framework
OpenAI Evals GitHub Framework for evaluating LLMs (17.9k stars)
RAGAS GitHub Evaluation framework for RAG systems (12.8k stars)
AgentEvals GitHub Agent evaluation framework with pytest/vitest integration (489 stars)
Deepeval GitHub External evaluator
Cleanlab Website Data quality and evaluation

Safety & Guardrails

Name URL Description
NVIDIA NeMo Guardrails GitHub Safety guardrails framework for AI applications
AI Safety Kits GitHub Libraries for AI safety evaluation

WebMCP Protocol

Name URL Description
WebMCP Music Composer GitHub Functional demonstration of WebMCP Protocol (40 stars)
WebMCP Playground GitHub Web-based MCP playground for testing (10 stars)
WebMCP Wix Integration GitHub Wix App with WebMCP protocol support
WebMCP WordPress Plugin GitHub WordPress plugin exposing site content via MCP
WebMCP CDP Tooling Suite GitHub Node.js library for WebMCP tools in Chrome via CDP
WebMCP Demo Apps GitHub Multiple demonstration apps showcasing WebMCP

Browser Automation

Name URL Description
PinchTab GitHub Browser control for AI agents - 12MB Go binary, HTTP API (4.9k stars)
PinchTab MCP Wrapper GitHub Token-efficient browser automation MCP server
PinchTab MCP (Ai-firelab) GitHub MCP server for PinchTab
PinchTab MCP (Domci) GitHub MCP stdio server for PinchTab
PinchTab Skill GitHub Browser automation via PinchTab HTTP API
icewm/pinchtab GitHub Lightweight HTTP browser bridge for AI automation
Playwright Skill GitHub Browser automation using Playwright
TinyFish BLS-Premium GitHub TinyFish browser automation for price tracking

Skills & Capabilities - Awesome Lists

Name Stars URL Description
hesreallyhim/awesome-claude-code 26.4k GitHub Skills, hooks, slash-commands, agent orchestrators for Claude Code
sickn33/antigravity-awesome-skills 20.4k GitHub 1000+ agentic skills for Claude Code/Antigravity/Cursor
Marketing Skills 11.2k GitHub Marketing skills: CRO, copywriting, SEO, analytics
VoltAgent/awesome-agent-skills 9.3k GitHub 500+ agent skills from official dev teams and community
OpenSkills 8.8k GitHub Universal skills loader for AI coding agents
AI Research Skills 4.4k GitHub AI research and engineering skills
heilcheng/awesome-agent-skills 2.7k GitHub Skills, tools, tutorials for AI coding agents
libukai/awesome-agent-skills 2.5k GitHub Agent Skills guide: Quick Start, Skills, News, Cases
Agent Scan (Snyk) 1.7k GitHub Security scanner for AI agents, MCP servers, and skills
tech-leads-club/agent-skills 1.6k GitHub Secure, validated skill registry for AI coding agents

Skills & Capabilities - Core & Specialized

Name URL Description
anthropics/skills GitHub Official Anthropic skills repository
mcp-builder skill GitHub Official skill for building MCP servers
obra/superpowers GitHub Core skills library for Claude Code (20+ skills)
obra/superpowers-lab GitHub Experimental skills repository
K-Dense-AI/claude-scientific-skills GitHub Skills for research, science, engineering, finance
ffuf-web-fuzzing GitHub Expert guidance for ffuf web fuzzing
trailofbits/skills GitHub Security skills: static analysis, CodeQL/Semgrep, code auditing
web-asset-generator GitHub Generates favicons, app icons, social media images

Claude Code Tools

Category Name URL
Orchestrator Auto-Claude GitHub
Orchestrator Claude Code Flow GitHub
Orchestrator Claude Squad GitHub
Orchestrator sudocode GitHub
Usage Monitor CC Usage GitHub
Usage Monitor ccflare GitHub
Usage Monitor better-ccflare GitHub
Usage Monitor Claude Code Usage Monitor GitHub
Status Line CCometixLine GitHub
Status Line ccstatusline GitHub
Status Line claude-powerline GitHub
Hook Dippy GitHub
Hook parry GitHub
Hook Claude Hook Comms (HCOM) GitHub
IDE Integration claude-code.nvim GitHub
IDE Integration claude-code.el GitHub
IDE Integration claude-code-ide.el GitHub
IDE Integration Claudix (VSCode) GitHub

Summary Statistics

Category Key Papers Eng. Blogs Repositories Resources
Memory Systems 9 19 5 33+
Sandboxes & Isolation 7 22 8 55+
MCP Protocol 0 14 15+ 40+
Agent Architectures 5 27 10+ 50+
Programmatic Tool Calling 1 3 4 15+
Multi-Agent Systems 1 3 6+ 20+
Planning & Reasoning 6 4 0 15+
WebMCP Protocol 0 0 6 6
Browser Automation 0 16 8 25+
State Management 2 3 0 10+
Observability & Debugging 0 11 4 25+
Evaluation & Benchmarking 4 14 4 35+
Error Handling 0 5 0 12+
Human-in-the-Loop 0 5 0 15+
Safety & Alignment 1 10 3 20+
Skills & Capabilities 2 2 30+ 55+
Total 38 158 100+ 430+

Recommended Reading Order

For practitioners starting with agentic system design:

Phase 1: Fundamentals (Week 1-2)

  1. Foundational Concepts: Start with ReAct paper (2022) and Chain of Thought (2023)
  2. Memory Systems: Read EverMemOS (2026), MIRIX (2025), and HiMeS (2026)
  3. Tool Integration: Explore MCP documentation and Toolformer (2023)

Phase 2: Architecture & Planning (Week 3-4)

  1. Architecture Patterns: Study LangGraph and CrewAI documentation
  2. Planning & Reasoning: Read Tree of Thoughts (2023) and LLM+P (2023)
  3. Multi-Agent Systems: Explore Generative Agents (2023) and AutoGPT

Phase 3: Operations & Safety (Week 5-6)

  1. Observability: Set up Phoenix or LangSmith for agent tracing
  2. Evaluation: Implement OpenAI Evals and RAGAS for benchmarking
  3. Safety: Review Constitutional AI (2023) and implement guardrails

Phase 4: Advanced Topics (Week 7-8)

  1. Skills Development: Browse awesome-agent-skills collections
  2. Browser Automation: Experiment with PinchTab and Playwright
  3. Sandboxing: Deploy E2B or Firecracker for secure execution

Contributing

To contribute corrections or additions, please reference the source URLs provided with each resource. This document is a living compilation updated regularly to reflect the rapidly evolving field of agentic system design.

Last Updated: March 5, 2026


License

This repository maintains an MIT License. See LICENSE file for details.

About

A repo with papers and blogs on Agentic System Design approaches.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Contributors