Latent State Marginalization as a Low-cost Approach for Improving Exploration
Authors: Dinghuai Zhang, Aaron Courville, Yoshua Bengio, Qinqing Zheng, Amy Zhang, Ricky T. Q. Chen
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. 5 EXPERIMENTS We evaluate SMAC on a series of diverse continuous control tasks from Deep Mind Control Suite (DMC; Tassa et al. (2018)). |
| Researcher Affiliation | Collaboration | Dinghuai Zhang , Aaron Courville, Yoshua Bengio Mila, University de Montreal Qinqing Zheng, Amy Zhang, Ricky T. Q. Chen Meta AI (FAIR) |
| Pseudocode | Yes | Algorithm 1 SMAC (without a world model) and Algorithm 2 SMAC (with a world model) |
| Open Source Code | Yes | Our implementation is open sourced at https://github.com/zdh Narsil/ Stochastic-Marginal-Actor-Critic. |
| Open Datasets | Yes | We evaluate SMAC on a series of diverse continuous control tasks from Deep Mind Control Suite (DMC; Tassa et al. (2018)). |
| Dataset Splits | No | The paper mentions a 'replay buffer D' for training and sampling states, but does not provide specific train/validation/test dataset splits, percentages, or counts for its experiments. |
| Hardware Specification | Yes | Tested with an NVIDIA Quadro GV100 on the pixel-based environments, our SMAC implementation does 60 frames per second (FPS) on average |
| Software Dependencies | No | The paper mentions using 'Py Torch' implementations for SAC and world models, but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We set the neural network width of the baselines and SMAC to 400 and 256 respectively to keep comparable number of parameters. For the entropy coefficients, we use the same autotuning approach from SAC (Haarnoja et al., 2018b). ... instead we set its learning rate to 3 10 4, which is empirically much better and also consistent with two other algorithms. ... We choose the best hyperparameters (number of particles in {8, 16, 32}, dimension of the latent in {8, 16, 32}) for each environment. |