Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning
Authors: Harry Zhao, Safa Alver, Harm van Seijen, Romain Laroche, Doina Precup, Yoshua Bengio
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Generalization-focused experiments validate Skipper s significant advantage in zero-shot generalization, compared to some existing state-of-the-art hierarchical planning methods. ... We show through detailed controlled experiments that the proposed framework, named Skipper, in most cases performs significantly better in terms of zero-shot generalization, compared to the baselines and to some state-of-the-art Hierarchical Planning (HP) methods... |
| Researcher Affiliation | Collaboration | 1Mc Gill University, 2Universit e de Montr eal, 3Mila, 4Sony AI, 5Google Deep Mind |
| Pseudocode | Yes | Algorithm 1: Skipper with Random Checkpoints (implementation choice in purple)... Algorithm 2: Checkpoint Pruning with k-medoids... Algorithm 3: Delusion Suppression |
| Open Source Code | Yes | Source code of experiments available at https://github.com/mila-iqia/Skipper. ... The results presented in the experiments are fully-reproducible with the open-sourced repository https://github.com/mila-iqia/Skipper. |
| Open Datasets | Yes | Thus, we base our experimental setting on the Mini Grid-Baby AI framework (Chevalier-Boisvert et al., 2018b;a; Hui et al., 2020). ... Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. Git Hub repository, 2018b. https://github.com/maximecb/ gym-minigrid. |
| Dataset Splits | No | The paper states, 'Across all experiments, we sample training tasks from an environment distribution of difficulty 0.4... The evaluation tasks are sampled from a gradient of OOD difficulties 0.25, 0.35, 0.45 and 0.55', detailing training and evaluation sets. However, it does not explicitly mention a distinct validation data split for hyperparameter tuning or model selection. |
| Hardware Specification | No | The paper mentions 'computational resources allocated to us from Mila and Mc Gill university' but does not specify any particular hardware components like CPU or GPU models, or memory details. |
| Software Dependencies | No | The paper refers to various algorithms and models such as 'Adam', 'C51 distributional TD learning', and 'VQ-VAE', citing their original papers. It also mentions basing experiments on publicly available code (e.g., for Director), but it does not specify concrete version numbers for software libraries like PyTorch or TensorFlow. |
| Experiment Setup | Yes | All the trainable parameters are optimized with Adam at a rate of 2.5 10 4 (Kingma & Ba, 2014), with a gradient clipping by value (maximum absolute value 1.0). ... The internal γ for intrinsic reward of π is 0.95, while the task γ is 0.99. ... The estimators, which operate on the partial states, are 3-layered MLPs with 256 hidden units. ... The whole checkpoint generator is trained end-to-end with a standard VAE loss. That is the sum of a KL-divergence for the agent s location, and the entropy of partial descriptions, weighted by 2.5 10 4... |