reproducibilityindex.ai

Bayesian Nonparametrics for Offline Skill Discovery

Authors: Valentin Villecroze, Harry Braviner, Panteha Naderian, Chris Maddison, Gabriel Loaiza-Ganem

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The goal of our experiments is twofold: to show that our options framework learns more useful skills than DDO and Comp ILE, and also that the nonparametric extensions of our own model and Comp ILE (which circumvent the need to specify K) match the performance of their respective parametric versions with K tuned as a hyperparameter. The former goal highlights the usefulness of incorporating variational inference advances to offline option learning, and the latter highlights the benefits of using Bayesian nonparametrics for skill discovery. All experimental details are given in Appendix C.
Researcher Affiliation	Collaboration	1Layer 6 AI, Toronto, Canada 2University of Toronto, Toronto, Canada 3Vector Institute, Toronto, Canada.
Pseudocode	Yes	Algorithm 1 Trajectory generation with options. ... Algorithm 2 Trajectory generation with Comp ILE.
Open Source Code	Yes	Our code is available at https: //github.com/layer6ai-labs/BNPO.
Open Datasets	Yes	We further test our model on several games from the Atari learning environment (Bellemare et al., 2013). For each game, we use expert trajectories generated by a trained Ape X agent (Horgan et al., 2018; Such et al., 2019).
Dataset Splits	No	The paper does not explicitly specify validation dataset splits (e.g., percentages, sample counts, or explicit mention of a validation set).
Hardware Specification	No	The paper does not specify any particular hardware details such as specific GPU models, CPU models, or memory amounts used for the experiments.
Software Dependencies	No	The paper lists several software dependencies such as Python, Matplotlib, TensorFlow, PyTorch, NumPy, and Stable-Baselines3, but it does not provide specific version numbers for these packages, which is necessary for reproducible setup.
Experiment Setup	Yes	The options sub-policies and termination functions consist of MLPs with two hidden layers of 16 units separated by a Re LU activation and followed by a Softmax activation. ... We use a learning rate of 0.005 with the Adam optimizer (Kingma & Ba, 2014) and a batch size of 128. The GS temperature parameter is initialized at 1 and annealed by a factor of 0.995 each epoch. λent is initialized at 5 and also annealed by a factor of 0.995 each epoch.