Embodied CoT Distillation From LLM To Off-the-shelf Agents

Authors: Wonje Choi, Woo Kyung Kim, Minjong Yoo, Honguk Woo

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments with the ALFRED benchmark demonstrate that DEDER surpasses leading language planning and distillation approaches, indicating the applicability and efficiency of s LM-based embodied policies derived through DEDER. 5. Evaluation
Researcher Affiliation Academia 1Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Republic of Korea. Correspondence to: Honguk Woo <hwoo@skku.edu>.
Pseudocode Yes Algorithm 1 Policy Distillation Rationale Dataset DRtn and Algorithm 2 Rationale Dataset Construction
Open Source Code No The paper refers to open-source projects for baselines and related work, but there is no explicit statement or link indicating that the code for DEDER (the work described in this paper) is open-source or publicly available.
Open Datasets Yes For evaluation, we use the ALFRED (Shridhar et al., 2020) environment.
Dataset Splits No The paper defines different evaluation task categories (Train, Seen, Unseen Spatial, Unseen Environment) and mentions using 312 trajectories for the expert dataset, which is used for training. However, it does not explicitly state the specific percentages or counts for a separate validation split of the training data.
Hardware Specification Yes Our framework is implemented using Python v3.9 and Py Torch v2.0.1, trained on a system of an Intel(R) Core (TM) i9-10980XE processor and an NVIDIA RTX A6000 GPU. We measured inference times several off-the-shelf devices such as RTX 3090, 3050 and 2080 Ti GPUs.
Software Dependencies Yes Our framework is implemented using Python v3.9 and Py Torch v2.0.1. We utilize various LMs like Pa LM, LLAMA, and GPT2-large. paraphrase-Mini LM-L6-v2 (LLM-Planner) stsb-roberta-large (ZSP)
Experiment Setup Yes We collect 312 expert trajectories in a variety of tasks varying the starting positions of the agent and objects as well as the indoor scenes. Train epochs 100, Batch size 1, Optimizer SGD, Learning rate 5e-5, Temperature 0.1, Beam Size 3, Top k 5, Top p 0.3, Maximum New Tokens 40, Scaling factor α 0.5