Official implementation for the paper: Translating Natural Language to Planning Goals with Large-Language Models
This repository provides a comprehensive benchmark and framework for evaluating language models' ability to translate natural language instructions into formal PDDL (Planning Domain Definition Language) goal specifications across multiple domains.
This repository contains implementations and benchmarks for testing language models on three key capabilities:
- Goal Translation: Converting natural language instructions to PDDL goal states
- Domain Understanding: Comprehending planning domain semantics and constraints
- Goal Inference: Reasoning about implicit requirements and task completion criteria
The framework spans two distinct domains with varying complexity levels:
- Blocksworld: Classic planning domain with explicit and implicit stack configurations
- Alfred: Household robotics tasks with partial specifications and open-ended objectives
.
├── Blocksworld/ # Blocksworld domain tasks
│ ├── domains/blocksworld/ # PDDL templates and generation scripts
│ ├── scripts/ # Evaluation utilities
│ └── README.md # Detailed Blocksworld documentation
│
├── Alfred/
│ ├── Alfred_Partially-Specified-Tasks_1to8/ # Tasks with incomplete specifications
│ │ ├── generate_initial_states.py # Initial state generation
│ │ ├── generate_goal_pddl.py # Goal state generation
│ │ ├── evaluate.py # Task-specific evaluation
│ │ └── README.md # Partially-specified tasks guide
│ │
│ └── Alfred_Open-Ended-Tasks_9to13/ # Complex household tasks
│ ├── openai_gen.py # NL to PDDL translation
│ ├── domain/ # PDDL domain definitions
│ └── README.md # Open-ended tasks guide
│
└── README.md # This file
- Python 3.7.16 or higher
- OpenAI API key (for n-shot inference)
The Blocksworld domain includes nine task types testing different aspects of goal specification:
- ExplicitStacks: Direct stack configurations
- ExplicitStacks-II: Sequential descriptions (top-to-bottom, bottom-to-top)
- NBlocks: Constraints on block counts
- KStacks: Constraints on stack counts
- PrimeStack: Stack height prime number constraints
- BlockAmbiguity: Ambiguous block references
- KStacksColor: Color-based stack constraints
See Blocksworld/README.md for detailed setup and evaluation instructions.
Seven task types focusing on spatial reasoning and quantitative constraints:
- MoveSynonym: Synonym understanding in movement commands
- MoveNextTo: Relative positioning
- MoveToCount2/3: Multi-object counting constraints
- MoveToMore: Comparative quantity reasoning
- MoveNested/Nested2: Hierarchical spatial relationships
See Alfred/Alfred_Partially-Specified-Tasks_1to8/README.md for generation and evaluation procedures.
Five realistic household scenarios:
- CutFruits: Slice and plate fruits
- PrepareMeal: Arrange meal components
- IceCream: Organize refrigerator contents
- SetTable2: Configure dining setup
- CleanKitchen: Organize kitchen items
See Alfred/Alfred_Open-Ended-Tasks_9to13/README.md for task descriptions and usage.
Generate PDDL problem templates and initial states for your chosen domain:
Blocksworld:
cd Blocksworld
bash domains/blocksworld/gen_pddl.shAlfred (Partially-Specified):
cd Alfred/Alfred_Partially-Specified-Tasks_1to8
python generate_initial_states.py --init_path init_states --scene scene9Translate natural language instructions to PDDL goals:
Blocksworld:
bash domains/blocksworld/fewshot_test.shAlfred (Partially-Specified):
python generate_goal_pddl.py --scene scene9 --init_path init_states --gen_path gen_resultsAlfred (Open-Ended):
python openai_gen.pyEvaluate generated goal states and model capabilities:
Blocksworld:
python scripts/evaluation.py --mode allAlfred (Partially-Specified):
python evaluate.py --scene scene9 --init_path init_states --gen_path gen_resultsAlfred (Partially-Specified):
Human Evaluated
The framework evaluates three dimensions:
- Goal Translation Success: Correctness of generated PDDL goal specifications
- Domain Understanding: Accuracy on domain-specific comprehension questions
- Goal Inference: Ability to infer implicit task requirements
Results include both success rates and detailed error analysis for each capability.
Toggle between zero-shot and one-shot learning by adding the --zero_shot flag:
python generate_goal_pddl.py --scene scene9 --init_path init_states --gen_path gen_results --zero_shotEvaluate only a subset of test cases:
python evaluate.py --scene scene9 --init_path init_states --gen_path gen_results --num_case 50If you use this benchmark in your research, please cite:
@article{xie2023translating,
title={Translating natural language to planning goals with large-language models},
author={Xie, Yaqi and Yu, Chen and Zhu, Tongyao and Bai, Jinbin and Gong, Ze and Soh, Harold},
journal={arXiv preprint arXiv:2302.05128},
year={2023}
}