Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning

Authors: Harry Zhao, Safa Alver, Harm van Seijen, Romain Laroche, Doina Precup, Yoshua Bengio

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Generalization-focused experiments validate Skipper s significant advantage in zero-shot generalization, compared to some existing state-of-the-art hierarchical planning methods. ... We show through detailed controlled experiments that the proposed framework, named Skipper, in most cases performs significantly better in terms of zero-shot generalization, compared to the baselines and to some state-of-the-art Hierarchical Planning (HP) methods...
Researcher Affiliation Collaboration 1Mc Gill University, 2Universit e de Montr eal, 3Mila, 4Sony AI, 5Google Deep Mind
Pseudocode Yes Algorithm 1: Skipper with Random Checkpoints (implementation choice in purple)... Algorithm 2: Checkpoint Pruning with k-medoids... Algorithm 3: Delusion Suppression
Open Source Code Yes Source code of experiments available at https://github.com/mila-iqia/Skipper. ... The results presented in the experiments are fully-reproducible with the open-sourced repository https://github.com/mila-iqia/Skipper.
Open Datasets Yes Thus, we base our experimental setting on the Mini Grid-Baby AI framework (Chevalier-Boisvert et al., 2018b;a; Hui et al., 2020). ... Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. Git Hub repository, 2018b. https://github.com/maximecb/ gym-minigrid.
Dataset Splits No The paper states, 'Across all experiments, we sample training tasks from an environment distribution of difficulty 0.4... The evaluation tasks are sampled from a gradient of OOD difficulties 0.25, 0.35, 0.45 and 0.55', detailing training and evaluation sets. However, it does not explicitly mention a distinct validation data split for hyperparameter tuning or model selection.
Hardware Specification No The paper mentions 'computational resources allocated to us from Mila and Mc Gill university' but does not specify any particular hardware components like CPU or GPU models, or memory details.
Software Dependencies No The paper refers to various algorithms and models such as 'Adam', 'C51 distributional TD learning', and 'VQ-VAE', citing their original papers. It also mentions basing experiments on publicly available code (e.g., for Director), but it does not specify concrete version numbers for software libraries like PyTorch or TensorFlow.
Experiment Setup Yes All the trainable parameters are optimized with Adam at a rate of 2.5 10 4 (Kingma & Ba, 2014), with a gradient clipping by value (maximum absolute value 1.0). ... The internal γ for intrinsic reward of π is 0.95, while the task γ is 0.99. ... The estimators, which operate on the partial states, are 3-layered MLPs with 256 hidden units. ... The whole checkpoint generator is trained end-to-end with a standard VAE loss. That is the sum of a KL-divergence for the agent s location, and the entropy of partial descriptions, weighted by 2.5 10 4...