An agent-based evaluation framework for complex code generation.
Below is a video demonstrating the main functionalities of the prototype. Click on the thumbnail to watch the demonstration, which showcases key features and the overall workflow of the framework in action.
default.mp4
Requirement-guided multi-dimensional context distillation
- Collecting contextual information based on the stepwise evaluation plan.
Fine-grained scoring and summarization
- Generating evaluation scores and reports through negotiation between multiple judges.
- Provides structured evaluation reports (Markdown & PDF format) with evaluation scores, environment configuration, task requirements, stepwise evaluation results, and overall evaluation results.
- Integrates various kinds of external tools for code evaluation, including dynamic execution, static linter, unit tests, screenshot/interaction, web browsing, and so on.
- Python 3.x
- Docker
- Git
- Clone the repository:
git clone https://github.com/Eshe0922/CodeVisionary.git- Build the Docker image:
cd docker
docker build -t codevisionary.evaluate .
docker pull eshe1836316339/codevisionary:lint
docker tag eshe1836316339/codevisionary:lint codevisionary.lint - Install the required dependencies:
pip install -r requirements.txt
npm install --save-dev prettier
apt-get install pandoc
apt-get install texlive-xetexYou can execute the run.sh script with the following arguments:
SCRIPT_DIR=$(cd "$(dirname "$0")"; pwd)
python3 main.py \
--evaluation_path "${SCRIPT_DIR}/dataset/benchmark_test.jsonl" \
--write_path "${SCRIPT_DIR}/experiments/test" \
--pdfWhere:
--evaluation_path: Path to the evaluation dataset in JSONL format. This file contains the questions and responses to be evaluated.--write_path: Directory where the evaluation results and outputs will be saved.--pdf: (Optional) If specified, the evaluation results will also be exported as a PDF report.
The evaluation dataset should be a JSON Lines file, where each line is a JSON object representing a single evaluation sample. Each object should have the following fields:
id: (int) Unique identifier for the sample.question: (str) The coding or evaluation question.response: (str) The code or answer generated by the model.model: (str) The name or identifier of the model that generated the response.
Example:
{"id": 4, "question": "Find the maximum element in a list.", "response": "def find_max(lst):\n return max(lst)", "model": "gpt-4"}agents/- Agent implementations for code evaluationdataset/- Datasets used for code evaluationdocker/- Docker-related configurationsexperiments/- Experiment resultstools/- External tools designed for code evaluationutils/- Utility functions and helper classesmain.py- Main entry pointrun.sh- Shell script for executing themain.py
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
@misc{wang2025codevisionaryagentbasedframeworkevaluating,
title={CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation},
author={Xinchen Wang and Pengfei Gao and Chao Peng and Ruida Hu and Cuiyun Gao},
year={2025},
eprint={2504.13472},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2504.13472},
}MIT
