Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Generating Symbolic World Models via Test-time Scaling of Large Language Models
Authors: Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge lin, Weiyang Liu
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 4 Experiments and Results We evaluate several methods on the International Planning Competition benchmark1, which encompasses diverse complex planning domains and problems. Our evaluation focuses on two key PDDL domain synthesis tasks, including (1) NL2Domain, which aims to convert natural language descriptions to PDDL Domains; and (2) Prob2Domai, which aims to derive necessary PDDL domains from PDDL problems. The evaluation metric used here is the success rate of the generated PDDL domain passing the PDDL validation system [20]. Table 2: A comparison of performance in PDDL domain synthesis between the baseline models (including both open-source and closed-source models) and our methods. Bo N-8 refers to Bo N sampling with 8 candidates, while i VML-5-Bo N-8 denotes five iterations of i VML training initialized with Bo N-8. In this experiment, the metric refers to the percentages of the generated PDDL domains that pass VAL validation without errors. |
| Researcher Affiliation | Collaboration | Zhouliang Yu1,2,*, Yuhuan Yuan3,*, Tim Z. Xiao4, Fuxiang Frank Xia5, Jie Fu6, Ge Zhang7, Ge Lin3, , Weiyang Liu4, 1The Chinese University of Hong Kong 2Hong Kong University of Science and Technology 3HKUST (Guangzhou) 4Max Planck Institute for Intelligent Systems, Tรผbingen 5Environmental Systems Research Institute 6Shanghai Artificial Intelligence Laboratory 7SEED, Bytedance |
| Pseudocode | No | The paper describes a 'simple yet effective algorithm' but does not present it in a formal pseudocode block or algorithm figure. It describes the steps verbally rather than in a structured, code-like format. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code or a link to a repository for the methodology described. While it mentions 'Open-Source Models' in Table 2, this refers to external models used in their experiments, not their own implementation code. |
| Open Datasets | Yes | We evaluate several methods on the International Planning Competition benchmark1, which encompasses diverse complex planning domains and problems. Our evaluation focuses on two key PDDL domain synthesis tasks...1https://github.com/potassco/pddl-instances We adopt the Planetarium [61] benchmark, which evaluates LLMs capacity to generate precise PDDL problems from natural language descriptions. |
| Dataset Splits | No | The paper mentions evaluating methods on the International Planning Competition benchmark and the Planetarium [61] benchmark. While Planetarium is noted to have 'fine-tuned Gemma and Mistral on a training dataset containing 132,027 examples', this refers to a setting for *other* baselines and not for the authors' own method or specific splits used for their experiments. The paper does not provide specific training/test/validation dataset splits for its experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It focuses on test-time computation and model performance rather than the underlying infrastructure. |
| Software Dependencies | No | The backbone LLMs in our experiment include Qwen2.5-Instruct (0.5B72B parameters) [51], LLa MA3.1-Instruct (8B and 70B parameters) [15], and Yi-1.5-Chat (6B, 9B, and 34B parameters) [53]. We also incorporate specialized code-oriented LLMs, specifically Qwen2.5-Coder and Yi1.5-Coder. In addition to open-source LLMs, we benchmark against Open AI s proprietary models, including GPT-4o, o1-mini, and o1-preview. Our PDDL-based method uses the Fast Downward system [19] for heuristic search and plan validation. The paper mentions various LLM models and the Fast Downward system, but it does not specify version numbers for these or other ancillary software libraries/environments. |
| Experiment Setup | Yes | To generate diverse PDDL domain synthesis paths, we use temperature sampling for both the Bo N and i VML algorithms, and the temperature equals 0.7. Bo N-8 refers to Bo N sampling with 8 candidates, while i VML-5-Bo N-8 denotes five iterations of i VML training initialized with Bo N-8. |