Model-Based Active Exploration

Authors: Pranav Shyam, Wojciech Jaśkowski, Faustino Gomez

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically that in semi-random discrete environments where directed exploration is critical to make progress, MAX is at least an order of magnitude more efficient than strong baselines. MAX scales to highdimensional continuous environments where it builds task-agnostic models that can be used for any downstream task. 3. Experiments
Researcher Affiliation Industry Pranav Shyam 1 Wojciech Ja skowski 1 Faustino Gomez 1 1NNAISENSE, Lugano, Switzerland. Correspondence to: Pranav Shyam <pranav@nnaisense.com>.
Pseudocode Yes Algorithm 1 MODEL-BASED ACTIVE EXPLORATION
Open Source Code Yes code: https://github.com/nnaisense/max
Open Datasets Yes A randomized version of the Chain environment (Figure 1a), designed to be hard to explore proposed by Osband et al. (2016)
Dataset Splits No The paper describes training processes and evaluation on environments (Chain, Ant Maze, Half Cheetah), but it does not specify explicit train/validation/test dataset splits with percentages or sample counts for reproducibility, nor does it refer to predefined splits with citations for these specific environments/tasks.
Hardware Specification No The paper does not specify any particular hardware components such as GPU or CPU models, memory, or cloud instance types used for running the experiments. It only references environments like MuJoCo.
Software Dependencies No The paper mentions software components like 'Mu Jo Co', 'DQN algorithm', and 'Soft-Actor Critic (SAC)', but it does not specify version numbers for any of these or other software dependencies used in the experiments.
Experiment Setup Yes For the chain environment, MAX used Monte-Carlo Tree Search to find open-loop exploration policies (see Appendix C for details). The hyper-parameters for both of the baseline methods were tuned with grid search. In this paper, λ was fixed to 0.1 for all continuous environments. Exploration policies were regularly trained with SAC from scratch with the utilities re-calculated using the latest models to avoid over-commitment. Models were probabilistic Deep Neural Networks trained with negative log-likelihood loss to predict the next state distributions in the form of multivariate Gaussians with diagonal covariance matrices. Soft-Actor Critic (SAC; Haarnoja et al., 2018) was used to learn both pure exploration and task-specific policies.