LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

Authors: Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, Minsu Jang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several enhancements of the baseline planner.
Researcher Affiliation Collaboration Jae-Woo Choi1 , Youngwoo Yoon1 , Hyobin Ong1,2, Jaehong Kim1, Minsu Jang1,2 1 Electronics and Telecommunications Research Institute 2 University of Science and Technology {jwchoi0717,youngwoo,ohnghb,jhkim504,minsu}@etri.re.kr
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found in the paper.
Open Source Code Yes 4) public release of benchmark code and extended dataset (WAH-NL); they are available at https://github.com/lbaa2022/LLMTask Planning.
Open Datasets Yes ALFRED dataset (Shridhar et al., 2020) with AI2-THOR simulator (Kolve et al., 2017), and 2) our extension of Watch-And-Help (WAH) dataset (Puig et al., 2021), WAH-NL, paired with Virtual Home simulator (Puig et al., 2018). [...] public release of benchmark code and extended dataset (WAH-NL); they are available at https://github.com/lbaa2022/LLMTask Planning.
Dataset Splits Yes The ALFRED dataset consists of three sets: train, valid-seen, and valid-unseen. The valid-seen was used to evaluate planning performance; the train set was only used to take examples to construct prompts.
Hardware Specification Yes Most of the models were run on a single NVIDIA A100 80GB GPU, while we used two A100 GPUs and three RTX 6000 GPUs for inference of larger models, such as OPT 66B and LLa MA 2 70B, with model parallelism.
Software Dependencies No The paper mentions software like Hugging Face's Transformers library, Open AI's GPT API, and Guidance library, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes The default setup is to include six examples in ALFRED and five examples in WAH-NL (one example per task type). [...] We finetuned LLa MA 1 models using Lo RA (Hu et al., 2021)... with the hyper-parameters set as follows: the rank of Lo RA modules was set to 16 (reduced to 8 for the 30B model due to GPU memory constraints), the dropout rate was 0.1, and the number of epochs was 5.