Do's and Don'ts: Learning Desirable Skills with Instruction Videos

Authors: HYUNSEUNG KIM, BYUNG KUN LEE, Hojoon Lee, Dongyoon Hwang, Donghu Kim, Jaegul Choo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate Do Dont, we conduct experiments on various continuous control tasks that require complex locomotion (e.g., Cheetah and Quadruped [45]) or precise manipulation (e.g., Kitchen [17]). Our results show that with fewer than eight instruction videos, Do Dont effectively learns complex locomotion skills (e.g., running quadruped), which are challenging to acquire with standard USD algorithms [39].
Researcher Affiliation Collaboration Hyunseung Kim1,2 Byungkun Lee1 Hojoon Lee1 Dongyoon Hwang1 Donghu Kim1 Jaegul Choo1 1KAIST, 2KRAFTON {mynsng,byungkun.lee,joonleesky,godnpeter,quagmire,jchoo}@kaist.ac.kr
Pseudocode Yes We provide the pseudocode of Do Dont Algorithm 1. Furthermore, our instruction network can be applied to zero-shot offline RL to learn diverse behaviors while prioritizing desirable behaviors within the offline unlabeled dataset. Detailed explanations and experiments are presented in Appendix A. Algorithm 1: Do s and Don ts (Online) Algorithm 2: Do s and Don ts (Offline)
Open Source Code Yes Code and videos are available at https://mynsng.github.io/dodont/
Open Datasets Yes For offline zero-shot RL, we use four environments (Walker, Cheetah, Quadruped, and Jaco) and two different Ex ORL datasets [52] (APS [31], RND [8]). We followed the experimental protocol outlined in the HILP paper [37], where detailed information is available in Appendix D.
Dataset Splits No The paper uses various environments and existing datasets but does not explicitly provide training/validation/test dataset splits (e.g., in percentages or sample counts) for its experiments, particularly for the online reinforcement learning components where data is generated through interaction.
Hardware Specification Yes Our experiments run on NVIDIA RTX 3090 GPUs, with each run taking no more than 28 hours.
Software Dependencies Yes The Adam optimizer [25] is employed with a learning rate of 1 10 4 and a batch size of 1024.
Experiment Setup Yes The Adam optimizer [25] is employed with a learning rate of 1 10 4 and a batch size of 1024. Additionally, Do Dont introduces only one extra hyperparameter, which is the coefficient for the instruction network. The complete list of hyperparameters can be found in Table 2, 3.