Skip to content

Eshe0922/CodeVisionary

Repository files navigation

CodeVisionary

CodeVisionary

An agent-based evaluation framework for complex code generation.

Framework Demonstration

Below is a video demonstrating the main functionalities of the prototype. Click on the thumbnail to watch the demonstration, which showcases key features and the overall workflow of the framework in action.

default.mp4

Features

Two-Stage Framework

Requirement-guided multi-dimensional context distillation

  • Collecting contextual information based on the stepwise evaluation plan.

Fine-grained scoring and summarization

  • Generating evaluation scores and reports through negotiation between multiple judges.

Detailed Evaluation Report

  • Provides structured evaluation reports (Markdown & PDF format) with evaluation scores, environment configuration, task requirements, stepwise evaluation results, and overall evaluation results.

Multi Tool Integration

  • Integrates various kinds of external tools for code evaluation, including dynamic execution, static linter, unit tests, screenshot/interaction, web browsing, and so on.

Prerequisites

  • Python 3.x
  • Docker
  • Git

Installation

  1. Clone the repository:
git clone https://github.com/Eshe0922/CodeVisionary.git
  1. Build the Docker image:
cd docker
docker build -t codevisionary.evaluate .
docker pull eshe1836316339/codevisionary:lint
docker tag eshe1836316339/codevisionary:lint codevisionary.lint 
  1. Install the required dependencies:
pip install -r requirements.txt
npm install --save-dev prettier
apt-get install pandoc
apt-get install texlive-xetex

Usage

You can execute the run.sh script with the following arguments:

SCRIPT_DIR=$(cd "$(dirname "$0")"; pwd)
python3 main.py \
  --evaluation_path "${SCRIPT_DIR}/dataset/benchmark_test.jsonl" \
  --write_path "${SCRIPT_DIR}/experiments/test" \
  --pdf

Where:

  • --evaluation_path: Path to the evaluation dataset in JSONL format. This file contains the questions and responses to be evaluated.
  • --write_path: Directory where the evaluation results and outputs will be saved.
  • --pdf: (Optional) If specified, the evaluation results will also be exported as a PDF report.

The evaluation dataset should be a JSON Lines file, where each line is a JSON object representing a single evaluation sample. Each object should have the following fields:

  • id: (int) Unique identifier for the sample.
  • question: (str) The coding or evaluation question.
  • response: (str) The code or answer generated by the model.
  • model: (str) The name or identifier of the model that generated the response.

Example:

{"id": 4, "question": "Find the maximum element in a list.", "response": "def find_max(lst):\n    return max(lst)", "model": "gpt-4"}

Project Structure

  • agents/ - Agent implementations for code evaluation
  • dataset/ - Datasets used for code evaluation
  • docker/ - Docker-related configurations
  • experiments/ - Experiment results
  • tools/ - External tools designed for code evaluation
  • utils/ - Utility functions and helper classes
  • main.py - Main entry point
  • run.sh - Shell script for executing the main.py

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Citation

@misc{wang2025codevisionaryagentbasedframeworkevaluating,
      title={CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation}, 
      author={Xinchen Wang and Pengfei Gao and Chao Peng and Ruida Hu and Cuiyun Gao},
      year={2025},
      eprint={2504.13472},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2504.13472}, 
}

License

MIT

Ackowledgement

https://github.com/Aider-AI/aider

About

[ASE'25] An Agent-based Evaluation Framework for Complex Code Generation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published