Latent World Models For Intrinsically Motivated Exploration
Authors: Aleksandr Ermolov, Nicu Sebe
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the method on image-based hard exploration environments from the Atari benchmark and report significant improvement with respect to prior work. |
| Researcher Affiliation | Academia | Aleksandr Ermolov, Nicu Sebe Department of Information Engineering and Computer Science (DISI) University of Trento, Italy {aleksandr.ermolov,niculae.sebe}@unitn.it |
| Pseudocode | No | The paper states 'The algorithm and the configuration are available in the Supplementary.' but does not contain pseudocode or an algorithm block in the main text. |
| Open Source Code | Yes | The source code of the method and all the experiments is available at https://github.com/htdt/lwm. |
| Open Datasets | Yes | We train the LWM method on 6 hard exploration Atari [3] environments: Freeway, Frostbite, Venture, Gravitar, Solaris and Montezuma s Revenge. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits with percentages or sample counts. It describes evaluation procedures (e.g., 'average the cumulative reward over 128 different layouts', 'training budget is 50M environment frames') but not explicit splits for reproducibility like '80/10/10 split' or 'X samples for validation'. |
| Hardware Specification | Yes | One experiment requires 7.5h of a virtual machine with one Nvidia T4 GPU. |
| Software Dependencies | No | The paper mentions software components like GRU, RNN, CNN, and DQN, but does not specify their version numbers or the versions of any programming languages or libraries used. |
| Experiment Setup | Yes | We use 1 frame as a state instead of 4; we do not decouple actors and learner... we employ GRU as RNN... the model performs 40 burn-in steps... We use 0.999 momentum to update the running average. We clip the normalized value to range [-10, 10]... we multiply the resulting value with the coefficient β... intrinsic reward scaling β = 0.01 for Freeway and β = 1 for others. The training budget is 50M environment frames, the final scores averaged over 128 episodes of an ϵ-greedy agent with ϵ = 0.001, each experiment is performed with 5 different random seeds. |