reproducibilityindex.ai

Model-Based Active Exploration

Authors: Pranav Shyam, Wojciech Jaśkowski, Faustino Gomez

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show empirically that in semi-random discrete environments where directed exploration is critical to make progress, MAX is at least an order of magnitude more efﬁcient than strong baselines. MAX scales to highdimensional continuous environments where it builds task-agnostic models that can be used for any downstream task. 3. Experiments
Researcher Affiliation	Industry	Pranav Shyam 1 Wojciech Ja skowski 1 Faustino Gomez 1 1NNAISENSE, Lugano, Switzerland. Correspondence to: Pranav Shyam <pranav@nnaisense.com>.
Pseudocode	Yes	Algorithm 1 MODEL-BASED ACTIVE EXPLORATION
Open Source Code	Yes	code: https://github.com/nnaisense/max
Open Datasets	Yes	A randomized version of the Chain environment (Figure 1a), designed to be hard to explore proposed by Osband et al. (2016)
Dataset Splits	No	The paper describes training processes and evaluation on environments (Chain, Ant Maze, Half Cheetah), but it does not specify explicit train/validation/test dataset splits with percentages or sample counts for reproducibility, nor does it refer to predefined splits with citations for these specific environments/tasks.
Hardware Specification	No	The paper does not specify any particular hardware components such as GPU or CPU models, memory, or cloud instance types used for running the experiments. It only references environments like MuJoCo.
Software Dependencies	No	The paper mentions software components like 'Mu Jo Co', 'DQN algorithm', and 'Soft-Actor Critic (SAC)', but it does not specify version numbers for any of these or other software dependencies used in the experiments.
Experiment Setup	Yes	For the chain environment, MAX used Monte-Carlo Tree Search to ﬁnd open-loop exploration policies (see Appendix C for details). The hyper-parameters for both of the baseline methods were tuned with grid search. In this paper, λ was ﬁxed to 0.1 for all continuous environments. Exploration policies were regularly trained with SAC from scratch with the utilities re-calculated using the latest models to avoid over-commitment. Models were probabilistic Deep Neural Networks trained with negative log-likelihood loss to predict the next state distributions in the form of multivariate Gaussians with diagonal covariance matrices. Soft-Actor Critic (SAC; Haarnoja et al., 2018) was used to learn both pure exploration and task-speciﬁc policies.