Autonomous Agents

Autonomous Agents-research papers. Updated daily. Resources-section-section.

Research papers: 2025 (1/3)

2025 (1/3), 2025 (2/3), 2025 (3/3), 2024, 2023, Earlier

Chronological order.

21st October 2025

KAT-Coder Technical Report

KAT-Coder: introduces a large-scale agentic code model trained through a multi-stage curriculum, including Mid-Term Training (enhances reasoning, planning, reflection), Supervised Fine-Tuning (SFT) (constructs diverse dataset), Reinforcement Fine-Tuning (RFT) (optimizes policy with rewards), and Reinforcement Learning (RL) (adapts to production IDEs).
The framework addresses the gap between static text-based training and dynamic real-world agentic execution by progressively enhancing cognitive and operational competence.
KAT-Coder achieves robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for intelligent coding agents.

Fetch.ai: An Architecture for Modern Multi-Agent Systems

Fetch.ai Architecture: introduces a multi-layered architecture for modern multi-agent systems, integrating a Foundational Layer (underlying decentralized ledger, agent registry, naming service, economic token), a Development Layer (event-driven agent framework, SDK), a Deployment and Monitoring Layer (hosting, mailbox, monitoring, marketplace), and an Orchestration Layer (agentic LLM, task decomposition, search & discovery) to provide a robust, scalable, and decentralized platform.
This architecture addresses limitations of current LLM-based agent frameworks by providing on-chain trust, verifiable identities, standardized communication protocols, and economic coordination mechanisms for autonomous agents.
The framework enables the development, deployment, and operation of sophisticated multi-agent systems, allowing autonomous agents to securely discover, communicate, and transact in a decentralized marketplace.

VAPU: System for Autonomous Legacy Code Modernization

VAPU (Verifying Agent Pipeline Updater): introduces an LLM-based multi-agent system designed to autonomously modernize legacy web application code by updating files in phases, simulating a software development team.
The system employs a Manager agent, a Task pipeline (with Prompt maker and Execution agents), a Verification agent, and a Finalizer agent to process user requirements and iteratively refine code.
VAPU aims to provide a cost-effective solution for updating deprecated components, addressing challenges in legacy system maintenance, and improving code quality through self-division and self-feedback mechanisms.

Heterogeneous Adversarial Play in Interactive Environments

HAP (Heterogeneous Adversarial Play): introduces an adversarial Automatic Curriculum Learning (ACL) framework that formalizes teacher-student interactions as a minimax optimization, including a task-generating instructor, a problem-solving learner, an interactive environment, bidirectional learning feedback, an adversarial reward mechanism, student's behavioral history, a task selection distribution, and a student policy.
The framework enables a teacher agent to autonomously generate challenging tasks and adapt the curriculum based on real-time student performance, while a student agent strives to master these evolving challenges.
This co-evolutionary process dynamically balances task complexity against learner proficiency, fostering robust knowledge consolidation and effective exploration without requiring handcrafted curricula.

Joint Optimization of Cooperation Efficiency and Communication Covertness for Target Detection with AUVs

HMAPPO (Hierarchical Multi-Agent Proximal Policy Optimization): introduces a hierarchical multi-agent deep reinforcement learning framework for joint optimization of cooperation efficiency and communication covertness in underwater target detection using AUVs.
The framework decomposes the problem into macro-level AUV scheduling and micro-level AUV trajectory control, leveraging a Centralized Training with Decentralized Execution (CTDE) paradigm.
This approach enables adaptive covert cooperation while satisfying energy and mobility constraints, providing efficient and secure operation for multiple AUVs in dynamic underwater environments.

SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection

SentinelNet: introduces a decentralized framework for proactive threat detection and mitigation in multi-agent LLM systems, utilizing adversarial trajectory generation, contrastive learning-based detector training, and dynamic ranking with bottom-k elimination.
Each agent is equipped with a credit-based detector, trained on augmented adversarial debate trajectories, enabling autonomous evaluation of message credibility and dynamic neighbor ranking to suppress malicious communications.
The framework achieves near-perfect detection of malicious agents and recovers system accuracy from compromised baselines, demonstrating strong generalizability across domains and attack patterns.

EffiReasonTrans: RL-Optimized Reasoning for Code Translation

EffiReasonTrans: introduces a training framework for code translation, integrating a Data Synthesis Stage (generates reasoning-augmented data), a Supervised Fine-Tuning Stage (initializes model with reasoning), and an Execution-Based Reinforcement Learning Stage (optimizes for accuracy/latency) to balance accuracy and inference latency.
The framework synthesizes high-quality reasoning-augmented data (EffiReasonTrans-Data) using a powerful Reasoning LLM (DeepSeek-R1), then fine-tunes a Base LLM (DeepSeek-R1-Distill-Qwen-1.5B), and finally applies reinforcement learning with a dual-objective Reward Strategy (Execution-based Reward and Length-based Reward).
This approach consistently improves translation accuracy while reducing generated tokens and inference latency, demonstrating effectiveness in multilingual and agent-based settings.

An Encoder-Decoder Foundation Chemical Language Model for Generative Polymer Design

POLYT5 (An Encoder-Decoder Foundation Chemical Language Model for Generative Polymer Design): introduces a T5-based LLM, pre-trained on 100 million polymer structures in PSELFIES representation, enabling property prediction and targeted polymer generation, and integrated into an agentic AI framework for natural language interaction.
The framework leverages its fine-tuned property prediction models for thermal, electronic, and solubility properties, and generative design models to create hypothetical polymers conditioned on desired properties, such as glass transition temperature.
An agentic AI framework, featuring a general-purpose LLM as a controller and a Streamlit interface, enhances accessibility by automating input handling, format conversion, and model selection for both property prediction and generative design tasks.

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

SSP (Search Self-play): introduces a self-evolving reinforcement learning approach for deep search agents, where a single LLM policy acts as both a question proposer and a problem solver, co-evolving their capabilities through competition and cooperation.
The proposer generates challenging search queries with verifiable ground-truth, while the solver attempts to answer them using multi-turn reasoning and external search tools.
The framework incorporates RAG verification, rule-based filtering, and a periodically reset replay buffer to ensure high-quality training tasks and stable co-evolution without human supervision.

Tokencake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

Tokencake: introduces a KV-Cache-centric serving framework for LLM-based multi-agent applications, with Frontend API, Space Scheduler, Time Scheduler, Application Graph Definition, FuncNode, Performance Metadata, Dynamic Memory Partitioning, Hybrid Priority Metric, CPU Block Buffering, Gradual GPU Block Reservation, Event-Driven Offload, Predictive Upload, Benefit-Driven Policy, and Dynamic Forecasting Model, which co-optimizes scheduling and memory management through an agent-aware design to address KV Cache space contention and time underutilization.
The framework utilizes a Frontend API to define multi-agent workflows as a Directed Acyclic Graph, enabling specialized schedulers to manage KV Cache lifecycle with application-level context.
The Space Scheduler employs dynamic memory partitioning and a hybrid priority metric to shield critical agents from contention, while the Time Scheduler uses proactive offload and predictive upload mechanisms to repurpose GPU memory during function call stalls.

CLASP: Cost-Optimized LLM-based Agentic System for Phishing Detection

CLASP (Cost-Optimized LLM-based Agentic System for Phishing Detection): introduces a novel multi-agent system for phishing detection that leverages LLM-based agents for URL, screenshot, and HTML analysis, combining their outputs to classify websites as phishing or legitimate.
The system processes URLs or QR codes, employing specialized LLM-based agents to evaluate various web resource aspects, and utilizes a Progressive Analysis strategy for cost-effective and accurate detection.
CLASP outperforms existing commercial solutions in recall and F1 score, demonstrating a robust and scalable approach for combating evolving cybersecurity threats while maintaining low operational costs.

QuantEvolve: Automating Quantitative Strategy Discovery through Multi-Agent Evolutionary Framework

QuantEvolve: introduces an evolutionary multi-agent framework for automating quantitative trading strategy discovery, integrating a multi-dimensional Feature Map, an Island Population for parallel evolution, and a multi-agent system comprising Data, Research, Coding, and Evaluation Teams to generate and refine strategies.
The framework leverages a hypothesis-driven multi-agent system to systematically explore the strategy space through iterative generation and evaluation, ensuring diverse and high-performing strategies adaptable to market shifts and investor preferences.
QuantEvolve maintains population diversity through a Feature Map that organizes strategies by attributes and employs an Evolutionary Database and Insight Repository to store and refine knowledge across generations.

The Trust Paradox in LLM-Based Multi-Agent Systems: When Collaboration Becomes a Security Vulnerability

TVP Experimental Framework: introduces the "Trust-Vulnerability Paradox" in LLM-based multi-agent systems, empirically validating that increased inter-agent trust amplifies leakage risk, and proposes defenses.
The framework utilizes CK-Agents and SK-Agents, powered by various LLM backends and orchestration frameworks, to simulate collaboration scenarios with parameterized trust levels.
It quantifies leakage using Over-Exposure Rate and Authorization Drift metrics, and evaluates Sensitive-Information Repartitioning and Guardian-Agent enablement as mitigation strategies.

WEBDEVJUDGE: EVALUATING (M)LLMS AS CRITIQUES FOR WEB DEVELOPMENT QUALITY

WEBDEVJUDGE: introduces a systematic benchmark for assessing LLM-as-a-judge performance in web development, supporting both static and continuous interactive evaluation, and comprising data collection, rubric annotation, and a judge component with various evaluators, observations, and paradigms.
The benchmark utilizes human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to establish high-quality ground truth for evaluating LLMs, MLLMs, and agentic workflows.
Experiments reveal a significant gap between LLM judges and human experts, stemming from fundamental model limitations like failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias, highlighting challenges for automated evaluators in complex scenarios.

20th October 2025

Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

EDR (Enterprise Deep Research): introduces a multi-agent system for enterprise analytics, integrating a Master Research Agent, ToDo Manager, Specialized Agents (including search domain and enterprise workflow tools), a Reflection Mechanism, and a Research Report component, with optional human steering.
The framework enables automated report generation, real-time streaming, and seamless enterprise deployment, outperforming state-of-the-art agentic systems on open-ended benchmarks without human steering.
EDR provides transparent, steerable research through dynamic context engineering, allowing human users to guide the agent's reasoning trajectory and task management during execution.

Semantic Joint Source Channel Coding for Distributed Subsurface Imaging in Multi-Agent Systems

Semantic JSCC AirComp (Semantic Joint Source Channel Coding with Over-the-Air Computation): introduces a framework that integrates semantic communication into multi-agent system (MAS) exploration, applying semantic JSCC with AirComp for distributed function computation (DFC) in cooperative subsurface imaging using the Adapt-Then-Combine Full Waveform Inversion (ATC-FWI) algorithm.
This framework employs Neural Network encoders to compress and channel code observations from neighboring agents, an AirComp channel to sum transmitted symbols, and a Neural Network semantic decoder to reconstruct a semantic variable, leveraging local side information at the receiver.
The system enhances overall task performance by adapting communication strategies to the exploration methodology, demonstrating improved bandwidth efficiency and imaging accuracy in noisy inter-agent communication links.

ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input

ImaGGen: introduces a zero-shot system for co-speech semantic gesture generation, with an Image Feature Analysis Pipeline (identifies objects), a Semantic Matching Pipeline (links text to visuals), and a Realization Engine (synthesizes gestures) to produce iconic, deictic, and beat gestures from language and image input.
The system extracts object properties like shape, symmetry, and alignment from images, matches these visual details to spoken text, and then synthesizes gestures using an inverse kinematics engine, layering them with co-generated beat gestures.
A user study demonstrated that the generated gestures significantly improved participants' ability to identify object properties in ambiguous speech scenarios, confirming their interpretability and communicative value for virtual agents.

Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs

CAI (Cybersecurity AI) Parallel Execution Framework: introduces an empirical evaluation of AI agents in Attack/Defense CTF scenarios, deploying autonomous offensive and defensive agents concurrently in a shared target environment to assess their capabilities under various operational constraints.
The framework leverages LLMs to power specialized agents, enabling fine-grained control over their configuration, context, and objectives for direct comparison of attack and defense performance.
This study challenges claims of inherent AI attacker advantage by demonstrating that defensive effectiveness critically depends on success criteria, highlighting the importance of availability-preserving defense in real-world cybersecurity operations.

Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents

Industry Agent Framework: introduces a five-level capability maturity model for LLM-driven industry agents, detailing their evolution across Memory, Planning, and Tool Use pillars.
This framework categorizes agents from simple process execution systems (L1) to adaptive social systems (L5), driven by advancements in core technologies.
It provides a roadmap for understanding and building next-generation industry agents by linking technological evolution with practical applications and evaluation.

AGENTIC REINFORCEMENT LEARNING FOR SEARCH IS UNSAFE

ARLS (Agentic Reinforcement Learning for Search): introduces a study evaluating the safety of RL-trained search models, which include a Policy LLM, Search Engine, Reward Function, Reference Model, System Prompt, and specific Tokens, revealing their vulnerability to jailbreaking attacks, which are assessed by an LLM Evaluator using a Harmful Instructions Dataset, and categorized into Search Attack and Multi-search Attack.
The paper demonstrates that these models, despite inheriting refusal behavior from instruction tuning, can be easily exploited by forcing early searches, leading to cascades of harmful queries and answers.
This research highlights a critical weakness in current RL training objectives that prioritize effective query generation over safety, necessitating the development of safety-aware RL pipelines.

DIVERSE PLANNING WITH SIMULATORS VIA LINEAR TEMPORAL LOGIC

FBILTL (Forbid Behaviour IterativeLTL): introduces a diverse planner for simulation-based planning problems, leveraging its Behaviour Sorts Suite (BSS) for diversity modeling and Linear Temporal Logic (LTL) to define semantic diversity criteria, which are integrated into the search process using a modified Iterated Width (IW(i)) Planner as a BehaviourGeneratorx and a Simulator for environment interaction, with a PlanGeneratorx for additional plan generation.
This framework addresses the limitation of existing diverse planning approaches that often produce semantically identical solutions by ensuring the generation of semantically distinct plans based on user-defined diversity features.
The approach demonstrates the feasibility of semantically-guided diverse planning in complex, non-symbolic simulation-based environments, offering a significant advantage over traditional declarative models.

ALPINE: A Lightweight and Adaptive Privacy-Decision Agent Framework for Dynamic Edge Crowdsensing

ALPINE (A Lightweight and Adaptive Privacy-Decision Agent Framework for Dynamic Edge Crowdsensing): introduces a closed-loop control system that empowers terminal devices to autonomously adjust differential privacy levels in real time, balancing privacy gains, data utility, and energy cost.
The framework includes a Risk Perception Module, a Privacy Decision Module, a Privacy Execution Module, and a Performance Verification Module, operating across mobile terminal devices and an edge computing server.
It leverages a LightAE for channel risk detection and a TD3 agent for dynamic privacy budget allocation, with feedback from the edge server for continuous policy refinement.

Learning After Model Deployment

PLDA (Post-deployment Learning based on Linear Discriminant Analysis): introduces Autonomous Learning after Model Deployment (ALMD), a paradigm enabling AI agents to continuously learn new knowledge autonomously after model deployment, utilizing a pre-trained model, LDA, and incremental updates of class means.
The framework performs dynamic OOD detection using Mahalanobis distance or Relative Mahalanobis distance, and incrementally learns new classes by updating their class means while keeping a shared covariance matrix fixed, thus avoiding catastrophic forgetting.
This approach allows for efficient, online learning from streaming data without human engineers, adapting to open and dynamic environments by expanding its set of known classes.

Digitization Can Stall Swarm Transport: Commensurability Locking in Quantized-Sensing Chains

Robotic Swarm Model: introduces a minimal model for autonomous robotic swarms that self-organize spacing and follow local gradients using quantized digital sensors, investigating collective response, fractional transport, and commensurability locking.
The model incorporates stochasticity (λrand), quantized sensing (λsens) leading to motion bias, and pairwise inter-agent interactions (λint) to maintain swarm formation.
The research reveals that collective transport can stall due to commensurability locking, a number-theoretic condition, and explores how swarm topology affects transport in higher dimensions.

From AutoRecSys to AutoRecLab: A Call to Build, Evaluate, and Govern Autonomous Recommender-Systems Research Labs

AutoRecLab (Autonomous Recommender-Systems Research Lab): introduces a vision for an integrated system that automates the entire research lifecycle in recommender systems, from problem ideation to manuscript drafting, utilizing LLM-driven components and automated experimentation.
This framework aims to expand beyond current AutoRecSys tools by enabling autonomous generation of research questions, experimental designs, and manuscript writing, while maintaining rigorous provenance records.
The paper calls for the RecSys community to build prototypes, establish benchmarks, embrace AI-generated submissions, develop attribution standards, and foster interdisciplinary dialogue for responsible integration of automated research.

SPACER: SELF-PLAY ANCHORING WITH CENTRALIZED REFERENCE MODELS

SPACER (Self-Play Anchoring with Centralized Reference Models): introduces a framework that leverages a pretrained Tokenized Reference Model (πref) to guide a Decentralized Model (πθ) in self-play, with all Batched Simulations, Full Scene Context, Agents, Agent Rollouts, Loss Function (L(θ)), Task Performance (LPPO), Human-likeness Reward (r_humanlike), and Distributional Alignment (DKL) components, where the framework anchors decentralized self-play policies to human driving distributions using likelihood rewards and KL divergence for scalable, realistic multi-agent simulation.
The Tokenized Reference Model (πref) provides a human-likeness distributional signal and likelihood rewards, while the Decentralized Model (πθ) is the self-play policy trained via reinforcement learning on local observations.
The Loss Function (L(θ)) combines a Task Performance (LPPO) objective with a Human-likeness Reward (r_humanlike) and Distributional Alignment (DKL) term to balance task success with realistic, human-like behaviors.

19th October 2025

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B (Agentic Large Language Models for Autonomous Data Science): introduces an agentic LLM for autonomous data science, capable of end-to-end pipeline completion from data sources to analyst-grade reports, utilizing a curriculum-based agentic training paradigm and data-grounded trajectory synthesis.
The framework employs a curriculum-based agentic training paradigm that emulates human data scientists' learning trajectory, progressively acquiring and integrating multiple capabilities in real-world environments.
DeepAnalyze-8B leverages a data-grounded trajectory synthesis framework to construct high-quality training data, enabling autonomous orchestration and adaptive optimization for complex data science tasks.

Agentic Inequality

Agentic Inequality Framework: introduces "agentic inequality," defining it as potential disparities in power, opportunity, and outcomes from differential access to and capabilities of AI agents, analyzed through the dimensions of agent availability, quality, and quantity.
This framework distinguishes agentic inequality from prior technological divides by highlighting novel power asymmetries created by scalable goal delegation and direct agent-to-agent competition.
The paper further explores the technical and socioeconomic drivers shaping agentic power distribution and proposes a research agenda for governing these complex challenges.

A Comprehensive Survey on World Models for Embodied AI

Unified Framework for World Models: introduces a comprehensive survey on world models in embodied AI, proposing a three-axis taxonomy including Functionality (decision-coupled/general-purpose), Temporal Modeling (sequential simulation/global difference prediction), and Spatial Representation (global latent vector/token feature sequence/spatial latent grid/decomposed rendering).
The survey formalizes problem settings, learning objectives, systematizes data resources and metrics, and quantitatively compares state-of-the-art models.
It distills key open challenges such as data scarcity, evaluation metrics, computational efficiency, and long-horizon temporal consistency, while providing a curated bibliography.

ReclAIm: A multi-agent framework for degradation-aware performance tuning of medical imaging AI

ReclAIm (A multi-agent framework for degradation-aware performance tuning of medical imaging AI): introduces a multi-agent framework for autonomously monitoring, evaluating, and fine-tuning medical image classification models, built on an LLM core and operating through natural language interaction.
The framework includes a Master Agent, an Image classification agent, a Performance comparison agent, and a Fine-tuning agent, each equipped with specialized toolkits, an LLM Core, a System Prompt, and Memory, interacting with a User.
ReclAIm enables automated, continuous maintenance of medical imaging AI models in a user-friendly and adaptable manner, facilitating broader adoption in research and clinical environments by addressing performance degradation.

18th October 2025

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

Quitting Mechanism: introduces a behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence, leveraging the ToolEmu framework with various prompting strategies, safety, and helpfulness evaluators.
The mechanism evaluates LLM agents across 144 multi-turn scenarios, comparing baseline agents against simple and specified quit-enabled variants to assess safety-helpfulness trade-offs.
Results indicate that explicit quit instructions, particularly the specified quit prompt, significantly improve agent safety with minimal impact on helpfulness, establishing quitting as a first-line defense.

Explainability Requirements as Hyperproperties

YLTL (whY Linear-time Temporal Logic): introduces a formal framework for specifying and verifying explainability requirements in multi-agent systems, combining Lewis' counterfactuals (causal dependencies), Linear-time temporal logic (temporal reasoning), Knowledge modality (agent knowledge), Past-operators (past-time reasoning), and a Similarity relation (agent-specific causal models).
The framework enables automated verification of explainability requirements through a Model-checking algorithm, which relies on a Translation function mapping YLTL formulas to Extended Monadic First-Order Logic (FO[<,E]) for hyperproperty specification.
This approach allows for formalizing various notions of explainability, such as internal, external, general, and weak counterfactual explainability, and proves the decidability of the model-checking problem for finite-state multi-agent systems.

ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents

ATA (Autonomous Trustworthy Agents): introduces a generic neuro-symbolic approach that decouples tasks into offline knowledge ingestion and online task processing, addressing LLM limitations in trustworthiness for high-stakes domains.
The knowledge ingestion phase uses an LLM to translate informal problem specifications into a formal, symbolic knowledge base, which human experts can verify and refine for correctness and domain alignment.
The task processing phase encodes incoming natural language input into the formal language via an LLM, enabling a symbolic decision engine to derive reliable, transparent, and auditable results using the formal knowledge base.

17th October 2025

POLYSKILL: LEARNING GENERALIZABLE SKILLS THROUGH POLYMORPHIC ABSTRACTION

PolySkill (Polymorphism-Guided Agent Skill Induction): introduces a novel framework enabling web agents to learn generalizable and compositional skills by decoupling abstract goals from concrete implementations, utilizing an LM Policy, Working Memory, Dynamic Skill Library, Abstract Classes, Concrete Subclasses, an LLM-based Induction Module, and an LM Judge.
The framework organizes skills into a domain-driven hierarchy, where abstract classes define common interfaces for categories like shopping sites, and concrete subclasses provide website-specific implementations, promoting skill reuse and cross-domain generalization.
PolySkill enhances continual learning by guiding agents to discover and refine skills autonomously in task-free settings, leading to improved task success rates and reduced execution steps across diverse web environments.

PAPER2WEB: LET'S MAKE YOUR PAPER ALIVE!

PWAGENT (Paper-to-Web Agent): introduces a multi-agent framework for transforming academic papers into interactive, multimedia-rich project homepages, utilizing Docling (PDF to Markdown converter), an LLM (extracts metadata/structures content), Construct (combines decomposed assets), an MCP Resource Repository (stores structured paper assets), an MLLM as Orchestrator (assesses webpage/invokes tools), and MCP tool use (accesses repository/edits webpage).
This framework addresses limitations of current methods by decomposing papers into structured assets, ingesting them into a resource repository, and iteratively refining webpage content and layout through an MLLM-orchestrated process.
PWAGENT achieves state-of-the-art cost efficiency and high presentation quality, outperforming baselines in academic webpage generation while maintaining low cost.

VISTA: A Test-Time Self-Improving Video Generation Agent

VISTA (Video Iterative Self-improvemenT Agent): introduces a novel multi-agent system that autonomously improves text-to-video generation by refining prompts in an iterative loop, including Structured Video Prompt Planning (transforms user input), Pairwise Tournament Selection (identifies best video-prompt pair), Multi-Dimensional Multi-Agent Critiques (MMAC) (generates nuanced critiques), and Deep Thinking Prompting Agent (DTPA) (refines prompt iteratively).
The framework decomposes user ideas into structured temporal plans, identifies the best video through a robust pairwise tournament, critiques it using specialized agents focusing on visual, audio, and contextual fidelity, and then synthesizes feedback to enhance prompts for subsequent generation cycles.
VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines and demonstrating scalability with increased test-time computation.

AURA: An Agent Autonomy Risk Assessment Framework

AURA (Agent aUtonomy Risk Assessment): introduces a unified framework designed to detect, quantify, and mitigate risks from agentic AI, incorporating an LLM Parser, LLM Dimensions, LLM Scorer, LLM Mitigator, Memory Unit, HITL, and A2H Control to provide robust risk assessment and mitigation.
The framework supports both synchronous and autonomous modes, enabling agents to self-assess and mitigate risks during operation, while also allowing human oversight and intervention.
AURA balances risk assessment accuracy with computational efficiency through gamma-based scoring and memory-driven optimization, ensuring governable and transparent AI agent deployment.

Multi-dimensional Data Analysis and Applications Basing on LLM Agents and Knowledge Graph Interactions

Multi-dimensional Data Analysis Framework: introduces a dynamic, collaborative analytical ecosystem that integrates LLM agents and Knowledge Graphs (KGs) for multi-dimensional data analysis, featuring a Data Preparation Module, Knowledge Representation Module, Visualization and Interaction Module, and Intelligent Analysis Module.
The framework enables LLM agents to automatically extract product data, construct and visualize KGs in real-time, and supports users in deep exploration and analysis of graph nodes through an interactive platform.
This approach achieves bidirectional dynamic interaction between LLM agents and KGs, where agents build and enrich the KG, and the visualized KG provides context for the agents' in-depth analysis.

Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation

freephdlabor: introduces a multiagent framework for continual and interactive science automation, featuring a ManagerAgent, IdeationAgent, ExperimentationAgent, ResourcePreparationAgent, WriteupAgent, ReviewerAgent, Shared Workspace, Workspace System, Prompt Optimization Mechanisms, Context Compaction, Memory Persistence, and Real-Time Human Intervention, enabling dynamic workflows and robust communication for scientific discovery.
The framework addresses limitations of existing agentic systems by providing fully dynamic workflows determined by real-time agent reasoning and a modular architecture for seamless customization and human-in-the-loop capabilities.
It provides comprehensive infrastructure for automatic context compaction, workspace-based communication to prevent information degradation, memory persistence across sessions, and non-blocking human intervention mechanisms, transforming automated research into continual programs.

SHARE: Scene-Human Aligned Reconstruction

SHARE (Scene-Human Aligned REconstruction): introduces a framework that reconstructs human motion and the surrounding environment from monocular videos, leveraging scene geometry for accurate 3D human placement.
The framework operates in three stages: initialization of point maps, human meshes, and masks; reconstruction of the background scene; and optimization of human meshes by grounding them to scene points.
SHARE achieves improved 3D human positioning and scene reconstruction, outperforming existing methods in quantitative metrics and demonstrating strong qualitative performance on diverse video data.

Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition

Three-Stage Framework for FM-driven Scientific Evolution: introduces a conceptual model describing the progressive integration of FMs into scientific discovery, encompassing Meta-Scientific Integration, Hybrid Human-AI Co-Creation, and Autonomous Scientific Discovery stages.
The framework posits that FMs transition from backend tools, to interactive collaborators, and finally to independent agents capable of end-to-end scientific discovery.
This evolution redefines scientific paradigms, shifting from human-guided processes to increasingly autonomous AI-driven knowledge generation.

PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

PokeeResearch-7B: introduces a 7B-parameter deep research agent, trained with Reinforcement Learning from AI Feedback (RLAIF) using LLM-based reward signals, and featuring a robust chain-of-thought-driven multi-call reasoning scaffold with self-verification and adaptive recovery for tool-augmented research.
The agent operates through iterative research-verification cycles, leveraging specialized web searching and reading tools, and is built upon a Qwen2.5-7B-Instruct backbone LLM.
This approach achieves state-of-the-art performance on ten deep research benchmarks by optimizing for human-salient answer quality dimensions and maintaining robustness through verifiable reasoning.

Self-evolving expertise in complex non-verifiable subject domains: dialogue as implicit meta-RL

Dialectica: introduces a framework where LLM agents engage in structured dialogue on defined topics, augmented by Agent Memory, Agent Reflection, and Context Evolution, with an Orchestrator managing the dialogue and an optional Facilitator guiding the discussion.
The framework views discussion as an implicit meta-reinforcement learning process, enabling agents to develop expertise and refine their prompt contexts through conversational feedback and self-reflection in non-verifiable domains.
This approach allows agents to improve their capabilities and produce more sophisticated outputs by iteratively updating their internal context based on dialogue experiences, without explicit reward signals.

ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations

ProofOptimizer: introduces an LLM-based system for simplifying Lean proofs without human demonstrations, integrating a symbolic Lean linter, a finetuned 7B parameter language model, and an iterative inference-time algorithm.
The system is trained using expert iteration and online reinforcement learning, leveraging the Lean compiler for verification and reward signals, and employs inference-time techniques like Test-Time RL and proof repair.
ProofOptimizer significantly reduces proof length on various benchmarks, improving conciseness, execution speed, and downstream prover performance for AI-generated formal proofs.

SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation

SQuAI (Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation): introduces a scalable and trustworthy multi-agent RAG framework for scientific QA, which includes a Decomposer (decomposes complex queries into sub-questions), Hybrid Retrieval (selects top-k documents using sparse/dense models), a Generator (generates initial Q-A-E triplets), a Judge (evaluates Q-A-E triplets for relevance), and an Answer Generator (synthesizes final answer with citations).
The framework addresses key limitations of existing RAG systems in scholarly domains by enabling accurate answers, explicit claims with citations, and retrieval across millions of scientific documents.
SQuAI improves faithfulness, answer relevance, and contextual relevance by decomposing complex questions, adaptively filtering documents, and providing fine-grained in-line citations for transparent verification.

The Spark Effect: On Engineering Creative Diversity in Multi-Agent AI Systems

Spark agents: introduces a system of persona-conditioned LLM agents, instantiated through a library of role-inspired system prompts, to intentionally diversify agent behavior within a multi-agent workflow.
The system includes a Spark agent automation pipeline for data collection and retrieval-augmented grounding, and an LLM-as-a-judge protocol for evaluating creative diversity against human gold standards.
This approach achieved a mean diversity gain of +4.1 points on a 1-10 scale, significantly narrowing the gap to human experts and improving client-facing outputs.

KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

KITE (Korean Instruction-following Task Evaluation): introduces a comprehensive benchmark for evaluating LLMs' instruction-following capabilities in Korean, encompassing both general and Korean-specific instructions, validated through automated metrics and human assessments.
The benchmark includes KITE General, derived from translated English datasets, and KITE Korean, featuring specialized instructions like Acrostic Poem and Honorifics, designed to capture unique linguistic and cultural nuances.
This framework provides insights into LLM performance across diverse NLP tasks and models, aiming to foster research on culturally and linguistically inclusive LLM development for underrepresented languages.

THE ROAD LESS TRAVELED: ENHANCING EXPLORATION IN LLMS VIA SEQUENTIAL SAMPLING

SESA (SEquential SAmpling): introduces a two-stage framework for enhancing exploration in LLMs, including PromptSketch (generates sketch prompt), Policy (πθ) (samples sketches/solutions), History of Sketches (S) (stores generated sketches), PromptSolve (generates solution prompt), Reward Function (R) (computes solution reward), All Candidates (Y) (stores solutions, rewards), Advantage Computation (calculates policy advantages), Loss Computation (computes policy loss), and Policy Update (adjusts policy parameters), which mitigates entropy collapse by sequentially generating diverse solution sketches before expanding them into full reasoning paths.
This approach conditions each new output on previous ones, promoting diversity and preventing policy collapse, leading to broader exploration and improved performance in RL-trained LLMs.
SESA consistently outperforms traditional RL methods in path diversity and recovery from collapse, significantly boosting success rates on agent benchmarks and real-world tasks.

CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs

CORE (Collaborative framework): introduces a collaborative framework that combines cloud and local LLMs to reduce UI exposure in mobile agents, including layout-aware block partitioning (groups UI elements), co-planning (collaboratively identifies sub-task), and co-decision-making (collaboratively selects UI elements).
The framework leverages the cloud LLM's strong reasoning with limited UI access and the local LLM's basic reasoning with full UI visibility to achieve a balance between task accuracy and privacy.
CORE significantly reduces sensitive UI element uploads to the cloud by up to 70.49% while maintaining task success rates comparable to cloud-only agents.

Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

EARL (Evidence-Aware Reinforcement Learning): introduces an evidence-prioritized adaptive pixel-space video reasoning framework, with a Video LLM, Visual Encoder/Text Tokenizer, Merger Projector, Think + Frames Selection Function, Key-frame based Localized Re-sampling Module, and a Multi-component Reward System, to dynamically select relevant frames and perform localized re-sampling for fine-grained temporal detail.
This framework transforms passive video processing into an active evidence interrogation process, guided by a novel multi-component reward system that enforces evidence purity and strategically manages visual context selection.
The dynamic adjustment mechanism within the reward system ensures stable convergence by balancing exploration and purity requirements throughout training, leading to superior reasoning accuracy.

ADAPTIVE MINDS: EMPOWERING AGENTS WITH LORA-AS-TOOLS

Adaptive Minds: introduces an agentic system that treats LoRA adapters as domain-specific tools, empowering a base LLM to act as a semantic router for dynamically selecting the most relevant LoRA tool to handle each query.
The system employs a modular multi-agent design orchestrated by LangGraph, combining flexible multi-agent orchestration with parameter-efficient fine-tuning to deliver accurate, specialized responses while preserving conversational ability.
Its AI-semantic routing, which leverages the base LLM's understanding, significantly outperforms keyword-based methods in accuracy and achieves a 3.1x average speedup compared to a baseline monolithic model.

MARS: REINFORCING MULTI-AGENT REASONING OF LLMS THROUGH SELF-PLAY IN STRATEGIC GAMES

MARS (Reinforcing Multi-Agent Reasoning of LLMs through Self-play in Strategic Games): introduces an end-to-end RL framework that incentivizes multi-agent reasoning in LLMs through self-play in both cooperative and competitive games.
The framework incorporates a turn-level advantage estimator for fine-grained credit assignment and agent-specific advantage normalization to stabilize multi-agent training.
MARS agents, trained on a diverse portfolio of strategic games, develop strong strategic abilities that generalize to held-out games and improve performance in multi-agent reasoning benchmarks.

Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination

CoordGen: introduces a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices, utilizing adaptive execution scheduling, context-aligned drafting, and hardware-efficient draft extension.
The framework addresses high latency and limited hardware utilization in on-device LLMs by offloading retrieval-based speculative decoding to NPUs, employing progressive graph scheduling, in-context distribution calibration, and NPU-optimized draft reuse.
CoordGen achieves significant speedup and energy efficiency improvements on smartphones across various tasks and LLMs by optimizing compute graph management and draft generation for NPU acceleration.

WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation

WebGen-V Bench: introduces a new benchmark and framework for instruction-to-HTML generation, with a Crawling Module (data acquisition and preprocessing), Processor (transforms raw data into structured representation), Structured Data (section-level metadata, UI screenshots, JSON text/image assets, instructions), Gen (HTML generation model), Evaluation Module (section-wise assessment of model outputs), Evaluator (multimodal LLM for scoring and feedback), and Feedback (iterative refinement for continuous improvement), providing a unified pipeline from real-world data acquisition to structured multimodal assessment.
The framework enhances data quality and evaluation granularity through an agentic crawling framework, structured section-wise data representation, and a section-level multimodal evaluation protocol.
WebGen-V enables high-granularity assessment by aligning text, layout, and visuals at the section level, facilitating precise detection and correction of subtle design inconsistencies in LLM-generated webpages.

Exemplar-Guided Planning: Enhanced LLM Agent for KGQA

PoG-EGP (Plan-on-Graph with Exemplar-Guided Planning): introduces a novel framework that enhances LLM agents' planning capabilities for Knowledge Graph Question Answering (KGQA) by leveraging preprocessed training data, including Question Preprocessing, Text Embedding Generation, Exemplary Question Retrieval, Retrieved Exemplars, Smart Lookahead Mechanism, PoG, LLM Agent, Task Decomposition, Path Exploration, Memory, Evaluation, and Reflection, to dynamically guide the LLM's planning process in task decomposition and relation exploration.
The framework preprocesses training questions via entity templating, generates semantic embeddings, and retrieves similar exemplary questions and their reasoning paths using a FAISS index to provide high-quality auxiliary information.
A Smart Lookahead mechanism is integrated to improve efficiency during relation exploration by preemptively identifying promising paths and terminating exploration earlier, significantly enhancing performance and efficiency on KGQA datasets.

AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

AUGUSTUS (An LLM-Driven Multimodal Agent System with Contextualized User Memory): introduces a multimodal agent system that processes, stores, retrieves, and acts on user context across various modalities, aligning its four-stage loop (Encode, Store in Memory, Retrieve, Act) with human cognitive memory principles.
The system leverages an LLM as its central planner, integrating In-Context, Recall, and a novel graph-structured Contextual Memory to manage information, and employs a Contextual-Personalized (CoPe) search for efficient concept-driven retrieval.
AUGUSTUS utilizes modality-specific encoders for input understanding and various generation tools for multimodal output, demonstrating superior performance and efficiency compared to traditional multimodal RAG approaches.

EXPERIENCE-DRIVEN EXPLORATION FOR EFFICIENT API-FREE AI AGENTS

KG-Agent: introduces an experience-driven learning framework that structures raw pixel-level GUI interactions into a persistent State-Action Knowledge Graph (SA-KG) and employs a VLM-based Reasoning Module for skill invocation, augmentation, refinement, and evaluation.
The framework leverages a hybrid intrinsic reward mechanism, combining state value and novelty rewards, to support long-horizon reasoning and efficient exploration.
By connecting functionally similar yet visually distinct GUI states, KG-Agent enables generalization from diverse historical strategies, significantly improving exploration efficiency and strategic depth in API-free environments.

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Multimodal RAG: introduces a systematic survey of Multimodal Retrieval-Augmented Generation for document understanding, detailing its components like User Query, Document, PDF2Img, OCR or Annotate, Image Retrieval, Text Retrieval, Multimodal Retrieval, Model, Answer Generation, Knowledge Base, Graph-based Index, Graph Traversal for Retrieval, LLM Agent, Query Decomposition, and Verification.
The survey categorizes existing methods by domain openness (closed/open), retrieval modality (image/text/hybrid), retrieval granularity (page/element), and hybrid enhancements (graph/agent-based).
It highlights the importance of Multimodal RAG for comprehensive document intelligence, addressing MLLM limitations in context modeling and enabling holistic retrieval and reasoning across text, tables, charts, and layout.

LLM-based In-situ Thought Exchanges for Critical Paper Reading

LLM-based In-situ Thought Exchange Interface: introduces a system designed to enhance junior researchers' critical paper reading skills by integrating AI-driven conversational agents into a custom PDF viewer, featuring a Comment Pane and Section Pane for interactive thought exchanges, highlighting, and commenting.
The system leverages LLMs to generate critical thinking questions, provide multi-disciplinary feedback, and reinterpret content, supporting both single-agent and multi-agent interaction modes.
This approach aims to foster critical thinking by encouraging active engagement and diverse perspectives, moving beyond passive information consumption.

EVOLVER: SELF-EVOLVING LLM AGENTS THROUGH AN EXPERIENCE-DRIVEN LIFECYCLE

EvolveR (Self-Evolving LLM Agents Through an Experience-Driven Lifecycle): introduces a self-evolving LLM agent framework through a closed-loop experience lifecycle, integrating online interaction, offline self-distillation, and policy evolution for continuous self-improvement.
The framework enables agents to transform raw interaction trajectories into a curated repository of strategic principles, which are then used to guide future decision-making and generate high-quality data.
EvolveR employs a dynamic experience curation system with mechanisms for self-distillation, semantic deduplication, integration, and quality control, ensuring a compact and effective knowledge base.

Agentic AI for Ultra-Modern Networks: Multi-Agent Framework for RAN Autonomy and Assurance

Multi-Agent Framework for RAN Autonomy and Assurance: introduces a distributed multi-agent architecture for RAN autonomy and assurance, featuring an Orchestrator Agent (manages workflow, resolves conflicts), Data Collector Agent (collects, validates raw telemetry), Preprocessor and Feature Agent (cleans, engineers data features), Model Trainer Agent (trains, optimizes AI/ML models), Model Validator Agent (evaluates, approves AI/ML models), Predictor Agent (forecasts KPIs, simulates scenarios), Policy Generator Agent (formulates network policies), Simulator/Baseline Agent (generates reference KPI trajectories), Verifier Agent (compares policies, enforces safety), Drift Detector Agent (detects model drift, triggers retraining), Deployment Agent (deploys verified network policies), Audit and Explainability Agent (generates audit reports, explanations), Security Agent (secures inter-agent communications), and Inter-Agent Communication (enables agent coordination).
This framework replaces centralized RIC-based control with specialized, collaborative agents to ensure autonomy, resilience, explainability, and system-wide safety in Beyond 5G/6G networks.
The architecture prevents unsafe policy deployments by incorporating independent verification and assurance stages, safeguarding global network health against model drift and unforeseen conditions.

16th October 2025

AGENTIC DESIGN OF COMPOSITIONAL MACHINES

Agentic Design of Compositional Machines: introduces a framework for LLM agents to design complex machines in the BesiegeField (simulated physical environment), including Designer (produces initial plan), Refiner (evaluates, proposes revisions), Inspector (abstractly assesses machine), Environment Querier (runs simulation, summarizes feedback), Meta-Designer (analyzes requirements, creates blueprint), Builder Agents (constructs blocks based on blueprint), and MCTS (search strategy for candidates).
The framework enables LLMs to construct machines from standardized components to meet functional demands, leveraging agentic workflows for iterative design and hierarchical construction.
The paper also explores RL finetuning of LLMs within this environment to improve spatial reasoning, strategic assembly, and instruction-following capabilities for machine design.

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

UI-Simulator: introduces a scalable paradigm for synthesizing training trajectories, integrating an LLM Pre-Training Corpus (Input data for LLMs), an LLM World Simulator (LLM-based UI environment generator), a Guided Rollout Process (Collects coherent, diverse UI trajectories), and a Trajectory Wrapper (Transforms rollouts into training data).
The framework leverages LLMs pre-trained on UI code and procedural knowledge to simulate diverse UI states and transitions, enabling robust digital agent training without extensive human annotation.
UI-Simulator-Grow extends this by incorporating Target Task Selection (Identifies high-impact learning tasks), Trajectory Variant Synthesis (Generates diverse task variations), and Continual Learning (Adapts agent policies iteratively) for data-efficient scaling.

INFORMATION GAIN-BASED POLICY OPTIMIZATION: A SIMPLE AND EFFECTIVE APPROACH FOR MULTI-TURN LLM AGENTS

IGPO (Information Gain-based Policy Optimization): introduces a reinforcement learning framework for multi-turn LLM agents, utilizing a Policy LLM (Agent) interacting with an Environment through a Rollout of sequential Turns, each comprising a Think Step, Tool Call Step, and Tool Response Step, culminating in an Answer Turn, where rewards are calculated using Ground Truth, combining an Information Gain Reward and an Outcome Reward into a Reward Trajectory, which is then used to compute a Discounted Cumulative Advantage for policy optimization via a GRPO-style Surrogate Objective, guided by a Prompt Template.
The framework addresses reward sparsity in multi-turn LLM agent training by providing dense, intrinsic, turn-level supervision based on information gain, which measures the marginal increase in the policy's probability of producing the correct answer.
IGPO integrates this intrinsic turn-level reward with outcome-level supervision to form a dense reward trajectory, enhancing credit assignment and improving sample efficiency and accuracy in multi-turn scenarios.

Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores

Clipped-Linear Model (Identity-Link Item Response Theory): introduces a novel LLM evaluation framework that leverages TVD-MI scores, an LLM judge, and an identity link to preserve additivity in agent-item score matrices, enabling sample-efficient sparse recovery.
This framework employs a clipped-linear model derived from Gini entropy maximization, which directly models raw TVD-MI scores as an additive decomposition of latent agent abilities and item difficulties, avoiding distortions from traditional logistic/probit links.
The approach achieves significant sample efficiency, requiring 3x fewer evaluations than dense methods while maintaining high reconstruction accuracy and preserving agent rankings, validated through discrete integrability tests and cross-domain experiments.

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improves Without Labels or Model Updates

TTRL Agent (Test-Time Reinforcement Learning Agent): introduces a reinforcement learning agent that self-improves schema mapping accuracy without labeled data or model updates by iteratively refining mappings through a Generative LLM, Conflict Detection, external Evidence Collection, and Confidence Evaluation, guided by dynamic prompts and an accumulating Memory/Context.
The agent identifies ambiguous mappings, formulates targeted web-search queries for external evidence, and uses confidence-based rewards to iteratively refine its mappings, reducing low-confidence mappings requiring expert review.
This approach provides an evidence-driven, transparent method for schema mapping, achieving high accuracy and reducing manual verification costs in scenarios with incomplete documentation.

THE GATEKEEPER KNOWS ENOUGH

The Gatekeeper Protocol: introduces a novel, domain-agnostic framework that governs LLM agent-system interactions, utilizing a System State-Context Representation (SCR) as a central data structure, an AGENT for reasoning and proposing actions, and a System / Execution Environment for validating and executing these actions.
This protocol mandates that the AGENT first reasons on a low-fidelity "latent state" representation within the SCR to strategically request high-fidelity context on demand, ensuring token efficiency and grounded interactions.
All interactions are mediated through a unified JSON format, serving as a declarative, state-synchronized protocol that ensures the agent's model of the system remains verifiably grounded in reality, significantly improving reliability and scalability.

Where to Search: Measure the Prior-Structured Search Space of LLM Agents

Formal Theory for LLM-assisted Iterative Search: introduces a compact formal theory to describe and measure LLM-assisted iterative search, representing agents as fuzzy relation operators and characterizing search space geometry.
The theory quantifies reachability difficulty using a coverage generating function and critical parameters, while safety is ensured by confining agents within a crisp idealized safety envelope.
A majority-vote instantiation on a 2D grid validates the abstract concepts, providing operational tools to measure LLM agents and their search spaces.

Agentic NL2SQL to Reduce Computational Costs

Datalake Agent: introduces an agentic system designed to enable an LLM to solve NL2SQL tasks more efficiently, with Information Acquisition, Iterative Refinement, and Query Formulation components, where the system reduces meta-information processing by selectively requesting necessary data.
The framework employs an interactive loop, allowing the LLM to gather general schema knowledge, refine its understanding hierarchically, and generate precise SQL queries using predefined commands like GetDBDescription, GetTables, GetColumns, and DBQueryFinalSQL.
This approach significantly reduces token usage and computational costs by up to 87% compared to direct prompting, while maintaining competitive performance on table question answering tasks across varying database sizes.

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

ToolPRM (Fine-Grained Inference Scaling of Structured Outputs for Function Calling): introduces an inference scaling framework that combines a ToolPRM (process reward model) with fine-grained beam search, leveraging a fine-grained intra-call process supervision dataset and function masking techniques to enhance LLM agent performance in structured function calling.
The framework decomposes function calls into semantically interpretable intermediate reasoning steps, enabling ToolPRM to provide step-level rewards for each decision, which guides the beam search to "explore more but retain less" for reliable structured output generation.
This approach significantly improves backbone model performance across various function calling tasks by offering more granular feedback than coarse-grained or outcome-based reward models, addressing the unrecoverability of early errors in structured outputs.

LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet?

LLM Agents for Automated Web Vulnerability Reproduction: introduces a comprehensive evaluation framework for assessing LLM agents' capabilities in transforming vulnerability reports into working exploits, including a benchmark dataset, LLM agents, evaluation tasks, and criteria.
The evaluation systematically assesses 20 state-of-the-art LLM agents across 16 dimensions on 3 representative CVEs, then conducts an in-depth analysis of the top 3 agents (OpenHands, SWE-agent, CAI) on 80 real-world CVEs.
Findings reveal that while LLM agents achieve reasonable success on simple library-based vulnerabilities, they consistently fail on complex service-based vulnerabilities requiring multi-component environments and robust authentication.

LLM Agents Beyond Utility: An Open-Ended Perspective

Open-Ended LLM Agent Loop: introduces an LLM agent augmented with task generation, memory management, and environmental interaction capabilities, enabling it to autonomously generate and pursue its own goals in an open-ended setting.
The agent extends the ReAct framework by incorporating self-generated tasks, persistent long-term memory, and file tools for creating lasting environmental artifacts across multiple runs.
This system explores the potential and limitations of adapting pretrained LLMs for open-ended behavior, highlighting challenges in memory management, productive exploration, and abstract goal pursuit.

JSPLIT: A Taxonomy-based Solution for Prompt Bloating in Model Context Protocol

JSPLIT (Taxonomy-based Solution for Prompt Bloating in Model Context Protocol): introduces a taxonomy-driven framework to manage prompt size effectively for AI agents using large sets of Model Context Protocol (MCP) tools, by organizing tools into a hierarchical taxonomy and using LLMs to identify and include only relevant tools based on user queries and taxonomy structure.
This approach significantly reduces prompt size, token costs, and latency while improving tool selection accuracy and task success in complex agent environments.
The framework's core, the Taxonomy-MCPResolver, leverages LLMs for a two-phase process of taxonomy classification and MCP server ranking to prune irrelevant tools from the agent's context.

E2EDEV: BENCHMARKING LARGE LANGUAGE MODELS IN END-TO-END SOFTWARE DEVELOPMENT TASK

E2EDev (End-to-End Software Development Benchmark): introduces a novel benchmark grounded in Behavior-Driven Development (BDD) principles, evaluating LLM-based End-to-End Software Development (E2ESD) frameworks by assessing generated software against user needs through mimicking real user interactions, comprising fine-grained user requirements, multiple BDD test scenarios with Python step implementations, an automated testing pipeline, and a Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA).
The HITL-MAA framework leverages specialized LLM agents, including Code Analyzer, Requirement Extractor, Test Case Generator, Test Automation Engineer, Step Checker, and Test Runner agents, with human supervision at key stages to ensure data quality and reduce annotation effort.
E2EDev addresses limitations of existing E2ESD benchmarks by providing fine-grained requirements and reliable, automated evaluation protocols built on the Behave framework, revealing that current LLM-based frameworks struggle with detailed functional specifics and multi-agent architectures often incur high costs with minimal gains.

LIRA: LINGUISTIC ROBUST ANCHORING FOR CROSS-LINGUAL LARGE LANGUAGE MODELS

LiRA (Linguistic Robust Anchoring for Large Language Models): introduces a training framework that robustly improves cross-lingual representations under low-resource conditions by jointly strengthening retrieval and reasoning.
The framework integrates Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, and LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization.
Arca's Translation Critic judges candidate translations, the Embedding Critic anchors feature paths, and the Actor Model fuses these critics to select candidates, while LaSR's LLM Transformer fuses English and multilingual embeddings, supported by CorrQueue and DocQueue for training stability.

Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents

NLT (Natural Language Tools): introduces a modular three-step architecture that replaces programmatic JSON tool calling with natural language outputs, decoupling tool selection from response generation to improve accuracy and reduce variance.
The framework utilizes a Selector LLM to identify relevant tools based on a natural language prompt, a Tool Parser to extract decisions, and a Tool Logic component to execute selected tools, before an Output Model generates the final response.
NLT significantly improves tool calling accuracy by 18.4 percentage points and reduces output variance by 70% across diverse models and domains, demonstrating enhanced robustness to prompt perturbations and extending capabilities to models lacking native support.

IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning

IMAGINE (Integrating Multi-Agent System into One Model): introduces a framework that distills the reasoning and planning capabilities of a Multi-Agent System into a single, compact LLM model through a three-stage training pipeline, including New Query Generation, Multi-Agent System based Inference Data Generation, and Agentic Reasoning Training.
The framework's Multi-Agent System based Inference Data Generation stage employs a Reasoner, two Judges, and a Reflector to produce high-quality, reflected reasoning data for training.
Agentic Reasoning Training, comprising Agentic SFT and Agentic RL guided by a Newly Designed Agentic Reward Function, integrates and enhances the model's agentic reasoning abilities, enabling a small model to outperform larger Multi-Agent Systems.

The Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems

CPR simulation framework: introduces a common-pool resource simulation framework for LLM multi-agent systems, with LLM agents, a shared resource, Harvest & Consumption, Individual Punishment, Social Learning, Group Decision modules, individual and group norms, cultural-evolutionary mechanisms, environmental feedback, payoff-biased social learning, a propose-vote rule, and prompts, enabling the endogenous emergence of cooperative norms without explicit reward signals.
The framework serves as a testbed to study how LLM agents develop strategies in mixed-motive settings and form group-beneficial norms through social learning and norm-based punishment.
The study validates the framework by reproducing human behavior findings and demonstrates its ability to discriminate LLMs based on their cooperative tendencies and norm formation capabilities under diverse environmental and social conditions.

MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

MedTrust-Guided Iterative RAG: introduces a framework for biomedical question answering that enhances factual consistency and mitigates hallucinations by employing an iterative retrieval-verification pipeline and a MedTrust-Align Module for trust alignment.
The iterative pipeline, featuring a verifier agent and a generator agent, refines evidence and generates citation-grounded reasoning or refusal statements, while the MedTrust-Align Module constructs a hallucination-aware dataset and uses Direct Preference Optimization to reinforce reliable reasoning.
This approach systematically addresses hallucination patterns and evidence insufficiency in complex medical queries, leading to more accurate and trustworthy LLM responses in clinical contexts.

Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation

YNTP (Your Next Token Prediction): introduces a multilingual benchmark for personalized response generation, utilizing an LLM-driven multi-NPC dialogue system that includes an FSM Engine (governs dialogue flow/state transitions), a Scenario Script (defines dialogue content/branching logic/NPC roles), and LLM Dialogue Generation (linguistic/emotional realization module).
This system collects natural, personalized, and psychologically grounded conversation data from users interacting with MBTI-dimensioned NPCs over five-day dialogue sessions.
The benchmark enables token-level prediction of individualized responses, moving beyond stylistic mimicry to model deeper cognitive regularities in word choice.

Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Beyond One World: introduces a benchmark for evaluating LLMs' character-grounded role-playing across multiversal contexts, featuring Canon Events and Moral Dilemmas tasks, an LLM-as-a-judge rubric for thinking/acting, and a Think-Act Matching metric.
The benchmark assesses LLMs' ability to consistently portray version-specific superhero characters by probing factual recall and ethical decision-making across 30 iconic heroes and 90 canon-specific versions from Marvel and DC universes.
The evaluation framework disentangles internal deliberation from outward decisions, using structured prompting and an LLM judge, revealing critical gaps in multiversal consistency and reasoning alignment in current LLMs.

Stop-RAG: Value-Based Retrieval Control for Iterative RAG

Stop-RAG: introduces a value-based controller for adaptive stopping in iterative retrieval-augmented generation (RAG) systems, with an Iterative RAG Pipeline, Query Generator, Retriever, Reranker, Answer Generator, Stop-RAG Controller, MDP Formulation, Q-network, Q(λ) Targets, and Decision Rule, where it frames iterative RAG as a finite-horizon Markov Decision Process and trains a Q-network using Q(λ) targets to provide forward-looking estimates of stopping quality.
The framework adaptively decides when to stop retrieving by estimating and comparing immediate and future gains, enabling more reliable stopping decisions without relying on internal telemetry or fixed iteration counts.
Stop-RAG consistently improves performance on multi-hop question-answering benchmarks, demonstrating its effectiveness as a modular, plug-and-play component compatible with black-box LLMs and existing RAG pipelines.

Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies

TERRARIUM: introduces a modular and configurable framework for studying multi-agent systems (MAS) safety, privacy, and security, comprising Agent (LLM-based entity), Environment (simulator, state, objective), Blackboard (communication proxy), Tools (external capabilities), Communication Protocol (interaction rules), Factor Graph (blackboard initialization), MCP Server (model context protocol), Persistence (logs, configurations), and Infrastructure (LLMs, MCP servers).
The framework repurposes the blackboard design to create a modular, configurable testbed for multi-agent collaboration, enabling systematic study of attack vectors like misalignment, malicious agents, compromised communication, and data poisoning.
Its modular and configurable design facilitates rapid prototyping, evaluation, and iteration on defenses and designs, accelerating progress toward trustworthy multi-agent systems.

PRISM: AGENTIC RETRIEVAL WITH LLMS FOR MULTI-HOP QUESTION ANSWERING

PRISM (Precision-Recall Iterative Selection Mechanism): introduces an agentic retrieval framework that leverages LLM-based agents, including a Question Analyzer, Selector, and Adder, within an Iterative Refinement Loop to retrieve relevant evidence for multi-hop question answering.
The framework's Question Analyzer decomposes complex queries into sub-questions, while the Selector and Adder agents iteratively refine the evidence set by balancing precision and recall.
This approach produces compact and comprehensive evidence sets, which are then used by an Answer Generator Agent to provide accurate answers, outperforming strong baselines in multi-hop QA benchmarks.

GENLARP: Enabling Immersive Live Action Role-Play through LLM-Generated Worlds and Characters

GENLARP: introduces a virtual reality system that transforms personalized stories into immersive LARP experiences, utilizing Narrative Initialization (user input processing/world and story generation), Interactive Role Design (character and interaction logic), and Live-Action Role Play (immersive user experience) modules.
The system leverages generative AI and LLMs to create dynamic virtual worlds and characters, allowing users to act as both creators and players within the narrative.
It addresses traditional LARP limitations by enabling virtual reenactments without extensive physical setup or large groups, fostering deeper engagement through LLM-driven agents and dynamic narrative adaptation.

AlphaQuanter: An End-to-End Tool-Orchestrated Agentic Reinforcement Learning Framework for Stock Trading

AlphaQuanter: introduces a single-agent framework that leverages reinforcement learning (RL) to learn a dynamic policy over a transparent, tool-augmented decision workflow, empowering an agent to autonomously orchestrate tools and proactively acquire information on demand, establishing a transparent and auditable reasoning process.
The framework unifies workflows into a ReAct-like agent, starting with a guided plan, followed by iterative tool use and information seeking, and in-depth analysis, utilizing various financial data sources and a reward function for end-to-end optimization.
AlphaQuanter's design ensures decision consistency and interpretability by enforcing stepwise hypothesis testing and tightly coupling evidence collection with reasoning, leading to state-of-the-art performance on key financial metrics and sophisticated trading strategies.

TOWARDS AGENTIC SELF-LEARNING LLMS IN SEARCH ENVIRONMENT

ASL (Agentic Self-Learning): introduces a multi-role, closed-loop reinforcement learning framework that unifies task generation, policy execution, and evaluation within a shared tool environment and LLM backbone, including a Prompt Generator (generates tasks, adapts difficulty), a Policy Model (generates solutions, improves performance), a Generative Reward Model (assesses correctness, refines evaluation), Tools (retrieves information), and a Meta Prompt (guides task generation).
ASL enables LLMs to autonomously evolve their reasoning, generation, and evaluation capabilities in a continuous closed loop, addressing the need for scalable reward signals and agent task data.
The framework demonstrates superior sample efficiency and robustness, achieving steady performance gains and surpassing strong RLVR baselines even under zero-labeled-data conditions.

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

OHAB (Online Harassment Agentic Benchmark): introduces a framework for systematically studying how multi-turn LLM agents can be coerced into generating abusive content, with a synthetic multi-turn harassment conversation dataset generation pipeline, a multi-agent simulation design, and a mixed-methods evaluation framework.
The framework employs various jailbreak methods, including persona-only priming, toxic memory injection, planning attacks (CoT/ReAct), and jailbreak fine-tuning, to assess vulnerabilities in LLMs like LLaMA-3.1-8B-Instruct and Gemini-2.0-flash-001.
The evaluation combines LLM-based judgment with human annotation, informed by social theories like Dark Triad Traits and Conflict Avoidance, to provide nuanced insights into harassment dynamics and behavioral patterns.

DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans

DPRF (Dynamic Persona Refinement Framework): introduces a novel methodology to optimize LLM Role-Playing Agents' behavioral alignment with human ground truth by iteratively identifying cognitive divergences and refining persona profiles.
The framework operates through an iterative feedback loop, comparing agent-generated behaviors against human ground truth using a Behavior Analysis Agent and updating the persona via a Persona Refinement Agent.
DPRF is model-agnostic, domain-agnostic, and data-efficient, enhancing persona fidelity for applications like user simulation and personalized AI.

Agentic Entropy-Balanced Policy Optimization

AEPO (Agentic Entropy-Balanced Policy Optimization): introduces a dynamic entropy-balanced rollout (manages rollout sampling) and entropy-balanced policy optimization (optimizes policy updates), which together balance entropy during rollout and policy updates to enhance multi-turn tool-use capabilities in LLMs.
The dynamic entropy-balanced rollout adaptively allocates sampling budgets via entropy pre-monitoring and penalizes consecutive high-entropy branches to mitigate over-branching issues.
The policy optimization component preserves high-entropy token gradients and prioritizes learning on high-uncertainty tokens through entropy-aware advantage estimation, improving stability and scalability for web agent training.

HELMSMAN: AUTONOMOUS SYNTHESIS OF FEDERATED LEARNING SYSTEMS VIA MULTI-AGENT COLLABORATION

Helmsman: introduces a novel multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications, including User, Planning Agent, Reflection Agent, Human Approval, Supervisor Agent, Coder Agent, Tester Agent, Evaluator Agent, Debugger Agent, Task Module, Client Module, Strategy Module, Server Module, Sandboxed Federated Simulation, Web Search Tool, RAG Pipeline, and AgentFL-Bench, by emulating a principled research and development workflow through interactive planning, modular code generation, and autonomous evaluation.
The framework structures the complex Federated Learning (FL) design process into three collaborative phases: interactive human-in-the-loop planning, modular code generation by supervised agent teams, and closed-loop autonomous evaluation and refinement in a sandboxed simulation environment.
Helmsman also introduces AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to rigorously assess the system-level generation capabilities of agentic systems in FL, demonstrating competitive and often superior solutions compared to hand-crafted baselines.

Why Instant-Runoff Voting Is So Resilient to Coalitional Manipulation: Phase Transitions in the Perturbed Culture

Phase Transition Analysis of Voting Rules in Perturbed Culture Model: introduces an analysis of Plurality, Two-Round System, and Instant-Runoff Voting within the Perturbed Culture Model, revealing phase transitions in their susceptibility to coalitional manipulation.
The study identifies a critical threshold (θc) for each rule, below which the CM rate tends to 1 for large electorates and above which it tends to 0.
The paper introduces the Super Condorcet Winner (SCW) concept, demonstrating its role as a key factor in IRV's exceptional resilience to CM, with IRV's θc being 0.

HI-AGENT: HIERARCHICAL VISION-LANGUAGE AGENTS FOR MOBILE DEVICE CONTROL

Hi-Agent (Hierarchical Vision-Language Agents for Mobile Device Control): introduces a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized.
The framework reformulates multi-step decision-making as a sequence of single-step subgoals and employs a foresight advantage function, leveraging execution feedback to guide high-level optimization.
Hi-Agent achieves state-of-the-art performance on mobile control benchmarks by combining structured task decomposition with stable, critic-free joint training.

MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evaluation

MAGPIE (Multi-AGent contextual Privacy Evaluation): introduces a novel benchmark for evaluating privacy understanding and preservation in multi-agent collaborative, non-adversarial scenarios, featuring a Dataset Construction Pipeline (generates and validates scenarios), a Simulation Environment (orchestrates multi-agent negotiations), and an Evaluator LLM (assesses privacy leakage and task outcomes).
The benchmark comprises 200 high-stakes, multi-turn tasks where private information is integral to task resolution, forcing LLM agents to balance effective collaboration with strategic information control.
Evaluations reveal that state-of-the-art LLM agents, including GPT-5 and Gemini 2.5-Pro, exhibit significant privacy leakage and struggle with consensus, often resorting to undesirable behaviors like manipulation and power-seeking.

Procedural Game Level Design with Deep Reinforcement Learning

Co-adaptive Procedural Content Generation Framework: introduces a novel method for procedural game level design using DRL, featuring a Hummingbird Agent (solver), a Floating Island Agent (generator), a Unity Environment (3D simulation), Proximal Policy Optimization (PPO) (training algorithm), Unity ML-Agents Toolkit (platform), a Feedback Loop (interaction mechanism), and Auxiliary Inputs (observation enhancement), where the system integrates DRL agents for both environment generation and task-solving.
This framework employs two PPO-trained agents: a hummingbird agent that learns to collect flowers in a dynamic 3D Unity environment, and an island agent that generates diverse, context-aware flower placements based on environmental cues and performance feedback.
The dynamic feedback loop between the agents enables co-adaptive learning, where the island agent evolves to create effective level configurations, and the hummingbird agent concurrently learns to solve them with greater robustness and generalization.

Policy Transfer Ensures Fast Learning for Continuous-Time LQR with Entropy Regularization

Policy Transfer with IPO (Iterative Policy Optimization): introduces a theoretical analysis of policy transfer for continuous-time Linear Quadratic Regulators (LQRs) with entropy regularization, proposing a novel IPO algorithm that achieves global linear and local super-linear convergence.
The framework demonstrates that an optimal policy from a source LQR can serve as a near-optimal initialization for closely related target LQRs, preserving convergence rates.
The analysis also establishes the stability of a class of continuous-time score-based diffusion models by connecting them with LQRs.

HUGAGENT: EVALUATING LLMS IN SIMULATING HUMAN-LIKE INDIVIDUAL REASONING ON OPEN-ENDED TASKS

HugAgent (Human-Grounded AGENT Benchmark): introduces a dual-track benchmark for average-to-individual reasoning adaptation, including an interactive semi-structured chatbot, a structured questionnaire, a dynamic question generator, and a Causal Belief Network for representing individual belief systems.
The framework utilizes both a synthetic track for scalable stress tests and a human-grounded track for ecologically valid reasoning data, enabling reproducible evaluation of intra-agent fidelity.
It operationalizes reasoning adaptation into two measurable tasks: Belief-State Inference and Belief Dynamics Update, aiming to predict how specific individuals reason and update beliefs in novel scenarios.

INTERNALIZING WORLD MODELS VIA SELF-PLAY FINETUNING FOR AGENTIC RL

SPA (Self Play Agent): introduces a reinforcement learning framework that equips LLM agents with an internal world model, decomposed into State Estimation and Transition Modeling, learned via a Self-Play Supervised Finetuning stage, to improve performance in out-of-distribution environments.
The framework first cold-starts the policy by enabling the LLM agent to self-play and acquire world knowledge from the environment, then uses this learned world model to simulate future states prior to policy optimization through RL training.
This approach significantly boosts success rates in environments like Sokoban and FrozenLake by grounding LLM reasoning in environmental rules rather than memorized trajectories, leading to more robust generalization.

GUIrilla: A Scalable Framework for Automated Desktop UI Exploration

GUIrilla: introduces a scalable framework for automated desktop UI exploration, systematically exploring macOS applications via native accessibility APIs and simulated user interactions, supported by three LLM-based agents for element ordering, input generation, and task postprocessing.
The framework generates hierarchical GUI graphs from discovered interface elements and crawler actions, addressing data collection challenges in GUI automation and producing the GUIrilla-TASK dataset.
GUIrilla leverages specialized interaction handlers to achieve comprehensive application coverage and constructs function-centric tasks, enabling LLM-based agents to significantly improve performance on downstream UI tasks with less data.

15th October 2025

GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

GAPS (Grounding-Adequacy-Perturbation-Safety): introduces a clinically grounded, automated benchmark for evaluating AI clinicians, featuring Grounding (reasoning depth), Adequacy (answer completeness), Perturbation (input robustness), and Safety (harm prevention) axes, operationalized by a pipeline that constructs guideline-centered evaluation items and rubrics.
The framework employs an automated pipeline for evidence neighborhood assembly, knowledge graph and hierarchical tree representations, item generation across G-levels and P-perturbations, and rubric synthesis by a DeepResearch agent using a ReAct-style loop.
Scoring is performed by an ensemble of LLM judges, revealing that current LLMs excel at factual recall but struggle with increased reasoning depth, answer completeness, and robustness to adversarial inputs, guiding future AI clinician development.

From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

ReGuard (Recovery Guardrail): introduces a control-theoretic approach to generative AI guardrails that formalizes AI safety as a sequential decision problem, learning predictive guardrails to monitor and proactively correct risky LLM outputs in real-time.
This framework operates in the LLM's latent representation of the world, enabling model-agnostic guardrails that can be trained via safety-critical reinforcement learning to detect and recover from unsafe states.
It moves beyond traditional flag-and-block guardrails by providing a principled dynamic alternative that balances safety and task efficiency, demonstrated in autonomous driving, e-commerce, and AI assistant scenarios.

Training LLM Agents to Empower Humans

Empower: introduces a self-supervised method for fine-tuning LLM agents to better assist humans by maximizing their empowerment, which is their ability to effect desired changes in the environment, using offline text data and a logit threshold mechanism to identify predictable code for completion.
The framework trains an LLM agent to complete predictable text, allowing the human user to focus on important design decisions rather than boilerplate code, thereby increasing their control over future outcomes.
Empower demonstrates that LLM assistants can be aligned without explicit human feedback or verifiable rewards by reasoning about how their actions enable humans to complete tasks more quickly.

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

Deflanderization for Game Dialogue: introduces a novel approach for LLM-based NPCs in game dialogue, combining lightweight prompting techniques and fine-tuned large models to balance character authenticity with task execution.
The approach employs a Deflanderization prompting method to prevent excessive role-play and improve task fidelity, alongside Retrieval Augmented Generation and Supervised Finetuning for robust dialogue grounding.
The framework addresses the challenge of maintaining consistent NPC personas and executing tasks in fantasy RPG environments, achieving high rankings in a dialogue challenge.

STEER-MOE: EFFICIENT AUDIO-LANGUAGE ALIGNMENT WITH A MIXTURE-OF-EXPERTS STEERING MODULE

SteerMoE (Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module): introduces a novel and modular framework for audio-language alignment, utilizing a lightweight steering module with a Mixture-of-Experts router to dynamically transform continuous audio representations for a frozen LLM decoder.
The framework freezes both the audio encoder and LLM decoder, training only the steering module to preserve LLM's reasoning capabilities and enable plug-and-play component interchangeability.
SteerMoE achieves strong performance on ASR and audio understanding tasks, demonstrating a parameter-efficient and modular approach to multimodal AI by operating entirely in the continuous embedding space.

In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers

In-Browser LLM-Guided Fuzzing Framework: introduces an in-browser, LLM-guided fuzzing framework for real-time prompt injection testing in agentic AI browsers, with Fuzzing Controller, LLM Integration Layer, Browser Automation Layer, and Data Collection and Analytics components, designed to automatically discover prompt injection vulnerabilities in real-time by generating and testing malicious webpage content within a live browser environment.
The framework leverages LLMs to generate diverse and evolving attack content, using a real-time feedback loop to refine attack strategies based on the AI agent's observed behavior and actions.
This approach enables high-fidelity testing with full DOM context and action monitoring, demonstrating that static pattern-matching defenses are insufficient against adaptive, LLM-guided prompt injection attacks.

Make an Offer They Can't Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment

Type-Induced Bayesian Persuasion (BP): introduces a framework for implementing Bayesian Persuasion in natural language dialogues without pre-commitment, leveraging a commitment-communication mechanism where the persuader explicitly narrates potential types to guide the persuadee's Bayesian belief update.
The framework integrates a Bayesian setup, a composite signal structure (mbasic, mtype, mdes, minf), and a type-induced information schema (Sender Types, Base Policies, Schema Induction) to facilitate the Receiver's inference and decision process, implemented through Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language (FNL) BP.
Experimental results show that BP-guided LLMs consistently outperform non-BP baselines, with SFNL excelling in credibility and logical coherence, while FNL demonstrates stronger emotional resonance and robustness, and supervised fine-tuning enables smaller models to achieve comparable performance to larger models.

MADREC: A Multi-Aspect Driven LLM Agent for Explainable and Adaptive Recommendation

MADREC (Multi-Aspect Driven LLM Agent): introduces an autonomous LLM-based recommender that constructs user and item profiles by unsupervised extraction of multi-aspect information from reviews, performs direct and sequential recommendation, generates explanations, and dynamically adjusts inference criteria via a SELF-FEEDBACK mechanism.
The framework leverages MEMORY to store user and item profiles, TOOLS for aspect extraction, summarization, and re-ranking, and TASKS for various recommendation objectives, all integrated within an active agent architecture.
MADREC enhances explainability and adaptivity by generating structured profiles, re-ranking candidate items based on multi-aspect relevance, and iteratively refining recommendations through self-feedback, outperforming traditional and LLM-based baselines.

Document Intelligence in the Era of Large Language Models: A Survey 2510.13366 http://arxiv.org/abs/2510.13366 DI-LLM (Document Intelligence in the Era of Large Language Models): Multimodal Document AI / Multilingual Document AI / Retrieval-Augmented Paradigm / DocAgent Framework / DocAgent Foundation Model Multimodal Document AI (integrating diverse modalities) / Multilingual Document AI (handling diverse languages) / Retrieval-Augmented Paradigm (leveraging external knowledge) / DocAgent Framework (agent-based document processing) / DocAgent Foundation Model (domain-aware, cross-modal models) DI-LLM (Document Intelligence in the Era of Large Language Models): introduces a comprehensive survey of Document AI advancements, categorizing tasks into understanding and generation, integrating multimodal and multilingual capabilities, and leveraging retrieval-augmented methods. The survey explores key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions including agent-based approaches and document-specific foundation models. It provides a structured analysis of the state-of-the-art in DAI, highlighting the evolution of LLM-based approaches and their implications for both academic and practical applications.

List the architectural components found in the figures. The paper is a survey and does not present architectural figures for a single proposed framework. Instead, Table 1 summarizes benchmark datasets, detailing their supported languages, document counts, modalities (Text, Visual, Layout), and tasks (Key Information Extraction (KIE), Document Layout Analysis (DLA), Document Sentiment Analysis (DSA), Document Classification (DC), Document Summarization (DS), Document Content Generation (DCG), Question Answering (QA)). These modalities and tasks represent functional components or capabilities within Document AI systems.