QueST: Self-Supervised Skill Abstractions for Learning Continuous Control

Authors: Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, Animesh Garg

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare to state-of-the-art imitation learning and LVM baselines and see that Que ST s architecture leads to strong performance on several multitask and fewshot learning benchmarks. Further results and videos are available at https: //quest-model.github.io.
Researcher Affiliation Collaboration Atharva Mete1, Haotian Xue1, Albert Wilcox1, Yongxin Chen1,2, Animesh Garg1,2 1Georgia Institute of Technology, 2NVIDIA
Pseudocode No No pseudocode or algorithm blocks are explicitly labeled in the paper.
Open Source Code Yes We include codes in the supplementary materials.
Open Datasets Yes LIBERO [38] is a lifelong learning benchmark featuring several task suites consisting of a variety of language-labeled rigidand articulated-body manipulation tasks. Specifically, we evaluate on the LIBERO-90 suite, which consists of 90 manipulation tasks, and the LIBERO-LONG suite, which consists of 10 long-horizon tasks composed of two tasks from the LIBERO-90 suite. Meta World [70] features a wide range of manipulation tasks designed to test few-shot learning algorithms. We use the Meta-Learning 45 (ML45) suite which consists of 45 training tasks and 5 difficult held-out tasks which are structurally similar to the training tasks.
Dataset Splits Yes For LIBERO-90, the learner receives 50 expert demonstrations per task from the author-provided dataset. For ML45, we use the scripted policies provided in the official Metaworld codebase to collect 100 demonstrations per task. In this setting we take the pretrained model from section 5.2 and test its 5-shot performance on unseen tasks from LIBERO-LONG and held-out set in ML45. We sample only five demonstrations for each task, generate the skill tokens using pretrained encoder and use them to finetune the skill prior and the decoder as described in Section 4.2.
Hardware Specification Yes For all our experiments we use a server consisting of 8 Nvidia RTX 1080Ti 10GB memory each. And all our models easily fit on one GPU for training.
Software Dependencies No The models are implemented in Py Torch. For transformer blocks, we used the transformers library from hugging face https://huggingface.co/docs/transformers/ with appropriate masking for ensuring causality. (No specific version numbers for PyTorch or Hugging Face transformers library are provided).
Experiment Setup Yes We present hyperparameters in the following tables: Table 3: Stage 1 Parameters Parameter Value encoder dim 256 decoder dim 256 sequence length (T) 16/32 encoder heads 4 encoder layers 2 decoder heads 4 decoder layers 4 attention dropout 0.1 fsq level [8, 5, 5, 5] conv layers 3 downsampling factor 2/4 Table 4: Stage 2 Parameters Parameter Value vocab size 1000 block size (n) 8 number of layers 6 number of heads 6 embedding dimension 384 attention dropout 0.1 beam size 5 temperature 1.0 decoder loss scale 0/10 execution horizon (Ta) 8 observation history 1