Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling
Authors: Jinghan Li, Zhicheng Sun, Yadong Mu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method is evaluated on the Virtual Home-Env benchmark, showing advanced performance with improved scaling w.r.t. inference-time computation. Code is available at https: //github.com/Singularity0104/ equilibrium-planner. [...] 4. Experiments [...] Table 1: Performance on Virtual Home-Env without correction. Our planner achieves state-of-the-art performance in most evaluations. |
| Researcher Affiliation | Academia | 1Peking Unviersity, China. Correspondence to: Yadong Mu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Inference of Equilibrium Planner |
| Open Source Code | Yes | Code is available at https: //github.com/Singularity0104/ equilibrium-planner. |
| Open Datasets | Yes | Our method is evaluated on the Virtual Home-Env benchmark (Puig et al., 2018; Liao et al., 2019), demonstrating its advantageous performance with better scaling w.r.t. inference computation than tree-based alternatives. |
| Dataset Splits | Yes | We randomly divide the Virtual Home-Env dataset into training set and test set in a 50:50 ratio. To analyze the generalizability of our method, we mainly study the following three subsets of the test set: novel scene set, novel task set, and novel scene and task set. Overall, the dataset contains 735 training trajectories, 468 trajectories within the novel task set, 95 trajectories within the novel scene set, 62 trajectories within the novel scene and task set. |
| Hardware Specification | No | The paper discusses 'Inference TFLOPS' and 'KV cache' for speeding up inference in Figure 5a and section B.3 respectively, but it does not specify any particular hardware components like GPU models (e.g., NVIDIA A100, RTX 2080 Ti) or CPU models used for the experiments. |
| Software Dependencies | No | Our implementation is consistent with the baseline methods by finetuning from Llama 3 8B (Dubey et al., 2024). The paper mentions the specific LLM (Llama 3 8B) used but does not provide specific version numbers for ancillary software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The equilibrium planner is finetuned for 6 iterations with a learning rate of 0.0002. [...] For the world model, we collect all interacting experiences between the planner and the environment, including plans and feedback, and finetune it for 5 epochs using the same learning rate of 0.0002. [...] A greedy LLM sampling strategy is used in later refinement steps until convergence. [...] The ratio of environmental interactions to world model calls is currently set to 1:1 |