Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generating Symbolic World Models via Test-time Scaling of Large Language Models

Authors: Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge lin, Weiyang Liu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 4 Experiments and Results We evaluate several methods on the International Planning Competition benchmark1, which encompasses diverse complex planning domains and problems. Our evaluation focuses on two key PDDL domain synthesis tasks, including (1) NL2Domain, which aims to convert natural language descriptions to PDDL Domains; and (2) Prob2Domai, which aims to derive necessary PDDL domains from PDDL problems. The evaluation metric used here is the success rate of the generated PDDL domain passing the PDDL validation system [20]. Table 2: A comparison of performance in PDDL domain synthesis between the baseline models (including both open-source and closed-source models) and our methods. Bo N-8 refers to Bo N sampling with 8 candidates, while i VML-5-Bo N-8 denotes five iterations of i VML training initialized with Bo N-8. In this experiment, the metric refers to the percentages of the generated PDDL domains that pass VAL validation without errors.
Researcher Affiliation Collaboration Zhouliang Yu1,2,*, Yuhuan Yuan3,*, Tim Z. Xiao4, Fuxiang Frank Xia5, Jie Fu6, Ge Zhang7, Ge Lin3, , Weiyang Liu4, 1The Chinese University of Hong Kong 2Hong Kong University of Science and Technology 3HKUST (Guangzhou) 4Max Planck Institute for Intelligent Systems, Tรผbingen 5Environmental Systems Research Institute 6Shanghai Artificial Intelligence Laboratory 7SEED, Bytedance
Pseudocode No The paper describes a 'simple yet effective algorithm' but does not present it in a formal pseudocode block or algorithm figure. It describes the steps verbally rather than in a structured, code-like format.
Open Source Code No The paper does not provide an explicit statement about releasing its own source code or a link to a repository for the methodology described. While it mentions 'Open-Source Models' in Table 2, this refers to external models used in their experiments, not their own implementation code.
Open Datasets Yes We evaluate several methods on the International Planning Competition benchmark1, which encompasses diverse complex planning domains and problems. Our evaluation focuses on two key PDDL domain synthesis tasks...1https://github.com/potassco/pddl-instances We adopt the Planetarium [61] benchmark, which evaluates LLMs capacity to generate precise PDDL problems from natural language descriptions.
Dataset Splits No The paper mentions evaluating methods on the International Planning Competition benchmark and the Planetarium [61] benchmark. While Planetarium is noted to have 'fine-tuned Gemma and Mistral on a training dataset containing 132,027 examples', this refers to a setting for *other* baselines and not for the authors' own method or specific splits used for their experiments. The paper does not provide specific training/test/validation dataset splits for its experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It focuses on test-time computation and model performance rather than the underlying infrastructure.
Software Dependencies No The backbone LLMs in our experiment include Qwen2.5-Instruct (0.5B72B parameters) [51], LLa MA3.1-Instruct (8B and 70B parameters) [15], and Yi-1.5-Chat (6B, 9B, and 34B parameters) [53]. We also incorporate specialized code-oriented LLMs, specifically Qwen2.5-Coder and Yi1.5-Coder. In addition to open-source LLMs, we benchmark against Open AI s proprietary models, including GPT-4o, o1-mini, and o1-preview. Our PDDL-based method uses the Fast Downward system [19] for heuristic search and plan validation. The paper mentions various LLM models and the Fast Downward system, but it does not specify version numbers for these or other ancillary software libraries/environments.
Experiment Setup Yes To generate diverse PDDL domain synthesis paths, we use temperature sampling for both the Bo N and i VML algorithms, and the temperature equals 0.7. Bo N-8 refers to Bo N sampling with 8 candidates, while i VML-5-Bo N-8 denotes five iterations of i VML training initialized with Bo N-8.