Combining Behaviors with the Successor Features Keyboard

Authors: Wilka Carvalho Carvalho, Andre Saraiva, Angelos Filos, Andrew Lampinen, Loic Matthey, Richard L Lewis, Honglak Lee, Satinder Singh, Danilo Jimenez Rezende, Daniel Zoran

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first compare CSFA against other methods for approximating SFs and show that only CSFA discovers representations compatible with SF&GPI at this scale. We then compare SFK against transfer learning baselines and show that it transfers most quickly to long-horizon tasks.
Researcher Affiliation Collaboration Wilka Carvalho ,1 Andre Saraiva2 Angelos Filos2 Andrew Kyle Lampinen2 Loic Matthey2 Richard L. Lewis3 Honglak Lee3 Satinder Singh2,3 Danilo J. Rezende2 Daniel Zoran2 1Harvard University 2Google Deep Mind 3University of Michigan
Pseudocode No The paper describes methods using text and diagrams, but does not include structured pseudocode or algorithm blocks.
Open Source Code No The code is proprietary so we cannot release it but we included implementation details below to enable reproducibility.
Open Datasets Yes We study SFK and CSFA in Playroom [10], a challenging 3D environment with high-dimensional pixel observations and long-horizon tasks defined by sparse rewards.
Dataset Splits No The paper describes training and transfer tasks but does not specify train/validation/test dataset splits with percentages, sample counts, or references to predefined splits for data partitioning.
Hardware Specification Yes All experiments were run using Google Dragonfish TPU with 2 × 2 topology devices. Experiments for training tasks ( § 5.1) lasted about 1.5 days. Our simplified experiments in C.1 lasted 5-6 hours. Transfer experiments ( § 5.2) lasted 2-3 hours. We ran a learner, actors, and evaluators on TPUs.
Software Dependencies No Everything was implemented with the jax ecosystem [40]. Architectures were implemented with the haiku library. Optimizers were implemented with the optax library. RL algorithms were implemented with the rlax library. Although software components are listed, specific version numbers for these libraries (JAX, Haiku, Optax, Rlax) are not provided, which is necessary for reproducible ancillary software details.
Experiment Setup Yes Following Mnih et al. [26], all architectures had a replay buffer that held up to 100,000 trajectories of length 30. [...] All methods used a learning rate of 3e-4 and set the max gradient norm to .5. We used the following coefficients to balance loss terms: βQ = 1e3, βϕ = 1e4. Since CSFA uses a categorical loss Lcatψ to learn SFs, which has a different magnitude from the scalar loss used to learn Q-values and , we found that we needed to make this smaller with βψ = 8e-3 whereas we used βψ = 5e2 for scalar-based methods.