Bayesian Nonparametrics for Offline Skill Discovery
Authors: Valentin Villecroze, Harry Braviner, Panteha Naderian, Chris Maddison, Gabriel Loaiza-Ganem
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The goal of our experiments is twofold: to show that our options framework learns more useful skills than DDO and Comp ILE, and also that the nonparametric extensions of our own model and Comp ILE (which circumvent the need to specify K) match the performance of their respective parametric versions with K tuned as a hyperparameter. The former goal highlights the usefulness of incorporating variational inference advances to offline option learning, and the latter highlights the benefits of using Bayesian nonparametrics for skill discovery. All experimental details are given in Appendix C. |
| Researcher Affiliation | Collaboration | 1Layer 6 AI, Toronto, Canada 2University of Toronto, Toronto, Canada 3Vector Institute, Toronto, Canada. |
| Pseudocode | Yes | Algorithm 1 Trajectory generation with options. ... Algorithm 2 Trajectory generation with Comp ILE. |
| Open Source Code | Yes | Our code is available at https: //github.com/layer6ai-labs/BNPO. |
| Open Datasets | Yes | We further test our model on several games from the Atari learning environment (Bellemare et al., 2013). For each game, we use expert trajectories generated by a trained Ape X agent (Horgan et al., 2018; Such et al., 2019). |
| Dataset Splits | No | The paper does not explicitly specify validation dataset splits (e.g., percentages, sample counts, or explicit mention of a validation set). |
| Hardware Specification | No | The paper does not specify any particular hardware details such as specific GPU models, CPU models, or memory amounts used for the experiments. |
| Software Dependencies | No | The paper lists several software dependencies such as Python, Matplotlib, TensorFlow, PyTorch, NumPy, and Stable-Baselines3, but it does not provide specific version numbers for these packages, which is necessary for reproducible setup. |
| Experiment Setup | Yes | The options sub-policies and termination functions consist of MLPs with two hidden layers of 16 units separated by a Re LU activation and followed by a Softmax activation. ... We use a learning rate of 0.005 with the Adam optimizer (Kingma & Ba, 2014) and a batch size of 128. The GS temperature parameter is initialized at 1 and annealed by a factor of 0.995 each epoch. λent is initialized at 5 and also annealed by a factor of 0.995 each epoch. |