Latent World Models For Intrinsically Motivated Exploration

Authors: Aleksandr Ermolov, Nicu Sebe

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the method on image-based hard exploration environments from the Atari benchmark and report significant improvement with respect to prior work.
Researcher Affiliation Academia Aleksandr Ermolov, Nicu Sebe Department of Information Engineering and Computer Science (DISI) University of Trento, Italy {aleksandr.ermolov,niculae.sebe}@unitn.it
Pseudocode No The paper states 'The algorithm and the configuration are available in the Supplementary.' but does not contain pseudocode or an algorithm block in the main text.
Open Source Code Yes The source code of the method and all the experiments is available at https://github.com/htdt/lwm.
Open Datasets Yes We train the LWM method on 6 hard exploration Atari [3] environments: Freeway, Frostbite, Venture, Gravitar, Solaris and Montezuma s Revenge.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits with percentages or sample counts. It describes evaluation procedures (e.g., 'average the cumulative reward over 128 different layouts', 'training budget is 50M environment frames') but not explicit splits for reproducibility like '80/10/10 split' or 'X samples for validation'.
Hardware Specification Yes One experiment requires 7.5h of a virtual machine with one Nvidia T4 GPU.
Software Dependencies No The paper mentions software components like GRU, RNN, CNN, and DQN, but does not specify their version numbers or the versions of any programming languages or libraries used.
Experiment Setup Yes We use 1 frame as a state instead of 4; we do not decouple actors and learner... we employ GRU as RNN... the model performs 40 burn-in steps... We use 0.999 momentum to update the running average. We clip the normalized value to range [-10, 10]... we multiply the resulting value with the coefficient β... intrinsic reward scaling β = 0.01 for Freeway and β = 1 for others. The training budget is 50M environment frames, the final scores averaged over 128 episodes of an ϵ-greedy agent with ϵ = 0.001, each experiment is performed with 5 different random seeds.