Bayesian Nonparametrics for Offline Skill Discovery

Authors: Valentin Villecroze, Harry Braviner, Panteha Naderian, Chris Maddison, Gabriel Loaiza-Ganem

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The goal of our experiments is twofold: to show that our options framework learns more useful skills than DDO and Comp ILE, and also that the nonparametric extensions of our own model and Comp ILE (which circumvent the need to specify K) match the performance of their respective parametric versions with K tuned as a hyperparameter. The former goal highlights the usefulness of incorporating variational inference advances to offline option learning, and the latter highlights the benefits of using Bayesian nonparametrics for skill discovery. All experimental details are given in Appendix C.
Researcher Affiliation Collaboration 1Layer 6 AI, Toronto, Canada 2University of Toronto, Toronto, Canada 3Vector Institute, Toronto, Canada.
Pseudocode Yes Algorithm 1 Trajectory generation with options. ... Algorithm 2 Trajectory generation with Comp ILE.
Open Source Code Yes Our code is available at https: //github.com/layer6ai-labs/BNPO.
Open Datasets Yes We further test our model on several games from the Atari learning environment (Bellemare et al., 2013). For each game, we use expert trajectories generated by a trained Ape X agent (Horgan et al., 2018; Such et al., 2019).
Dataset Splits No The paper does not explicitly specify validation dataset splits (e.g., percentages, sample counts, or explicit mention of a validation set).
Hardware Specification No The paper does not specify any particular hardware details such as specific GPU models, CPU models, or memory amounts used for the experiments.
Software Dependencies No The paper lists several software dependencies such as Python, Matplotlib, TensorFlow, PyTorch, NumPy, and Stable-Baselines3, but it does not provide specific version numbers for these packages, which is necessary for reproducible setup.
Experiment Setup Yes The options sub-policies and termination functions consist of MLPs with two hidden layers of 16 units separated by a Re LU activation and followed by a Softmax activation. ... We use a learning rate of 0.005 with the Adam optimizer (Kingma & Ba, 2014) and a batch size of 128. The GS temperature parameter is initialized at 1 and annealed by a factor of 0.995 each epoch. λent is initialized at 5 and also annealed by a factor of 0.995 each epoch.