Do's and Don'ts: Learning Desirable Skills with Instruction Videos
Authors: HYUNSEUNG KIM, BYUNG KUN LEE, Hojoon Lee, Dongyoon Hwang, Donghu Kim, Jaegul Choo
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate Do Dont, we conduct experiments on various continuous control tasks that require complex locomotion (e.g., Cheetah and Quadruped [45]) or precise manipulation (e.g., Kitchen [17]). Our results show that with fewer than eight instruction videos, Do Dont effectively learns complex locomotion skills (e.g., running quadruped), which are challenging to acquire with standard USD algorithms [39]. |
| Researcher Affiliation | Collaboration | Hyunseung Kim1,2 Byungkun Lee1 Hojoon Lee1 Dongyoon Hwang1 Donghu Kim1 Jaegul Choo1 1KAIST, 2KRAFTON {mynsng,byungkun.lee,joonleesky,godnpeter,quagmire,jchoo}@kaist.ac.kr |
| Pseudocode | Yes | We provide the pseudocode of Do Dont Algorithm 1. Furthermore, our instruction network can be applied to zero-shot offline RL to learn diverse behaviors while prioritizing desirable behaviors within the offline unlabeled dataset. Detailed explanations and experiments are presented in Appendix A. Algorithm 1: Do s and Don ts (Online) Algorithm 2: Do s and Don ts (Offline) |
| Open Source Code | Yes | Code and videos are available at https://mynsng.github.io/dodont/ |
| Open Datasets | Yes | For offline zero-shot RL, we use four environments (Walker, Cheetah, Quadruped, and Jaco) and two different Ex ORL datasets [52] (APS [31], RND [8]). We followed the experimental protocol outlined in the HILP paper [37], where detailed information is available in Appendix D. |
| Dataset Splits | No | The paper uses various environments and existing datasets but does not explicitly provide training/validation/test dataset splits (e.g., in percentages or sample counts) for its experiments, particularly for the online reinforcement learning components where data is generated through interaction. |
| Hardware Specification | Yes | Our experiments run on NVIDIA RTX 3090 GPUs, with each run taking no more than 28 hours. |
| Software Dependencies | Yes | The Adam optimizer [25] is employed with a learning rate of 1 10 4 and a batch size of 1024. |
| Experiment Setup | Yes | The Adam optimizer [25] is employed with a learning rate of 1 10 4 and a batch size of 1024. Additionally, Do Dont introduces only one extra hyperparameter, which is the coefficient for the instruction network. The complete list of hyperparameters can be found in Table 2, 3. |