LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents
Authors: Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, Minsu Jang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several enhancements of the baseline planner. |
| Researcher Affiliation | Collaboration | Jae-Woo Choi1 , Youngwoo Yoon1 , Hyobin Ong1,2, Jaehong Kim1, Minsu Jang1,2 1 Electronics and Telecommunications Research Institute 2 University of Science and Technology {jwchoi0717,youngwoo,ohnghb,jhkim504,minsu}@etri.re.kr |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found in the paper. |
| Open Source Code | Yes | 4) public release of benchmark code and extended dataset (WAH-NL); they are available at https://github.com/lbaa2022/LLMTask Planning. |
| Open Datasets | Yes | ALFRED dataset (Shridhar et al., 2020) with AI2-THOR simulator (Kolve et al., 2017), and 2) our extension of Watch-And-Help (WAH) dataset (Puig et al., 2021), WAH-NL, paired with Virtual Home simulator (Puig et al., 2018). [...] public release of benchmark code and extended dataset (WAH-NL); they are available at https://github.com/lbaa2022/LLMTask Planning. |
| Dataset Splits | Yes | The ALFRED dataset consists of three sets: train, valid-seen, and valid-unseen. The valid-seen was used to evaluate planning performance; the train set was only used to take examples to construct prompts. |
| Hardware Specification | Yes | Most of the models were run on a single NVIDIA A100 80GB GPU, while we used two A100 GPUs and three RTX 6000 GPUs for inference of larger models, such as OPT 66B and LLa MA 2 70B, with model parallelism. |
| Software Dependencies | No | The paper mentions software like Hugging Face's Transformers library, Open AI's GPT API, and Guidance library, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | The default setup is to include six examples in ALFRED and five examples in WAH-NL (one example per task type). [...] We finetuned LLa MA 1 models using Lo RA (Hu et al., 2021)... with the hyper-parameters set as follows: the rank of Lo RA modules was set to 16 (reduced to 8 for the 30B model due to GPU memory constraints), the dropout rate was 0.1, and the number of epochs was 5. |