A First-Occupancy Representation for Reinforcement Learning

Authors: Ted Moskovitz, Spencer R Wilson, Maneesh Sahani

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now demonstrate the broad applicability of the FR, and highlight ways its properties differ from those of the SR. We focus on 4 areas: exploration, unsupervised RL, planning, and animal behavior. We tested our approach on the RIVERSWIM and SIXARMS problems (Strehl & Littman, 2008), two hard-exploration tasks from the PAC-MDP literature. The results are listed in Table 1
Researcher Affiliation Academia 1Gatsby Unit, UCL 2 Sainsbury Wellcome Centre, UCL
Pseudocode Yes Algorithm 1: FR Planning (FRP) and Algorithm 2: Construct Plan are provided in Appendix A.2.
Open Source Code Yes We have attached code for the tabular experiments (also available at github.com/tedmoskovitz/first_occupancy)
Open Datasets Yes We tested our approach on the RIVERSWIM and SIXARMS problems (Strehl & Littman, 2008), Continuous Mountain Car task (Brockman et al., 2016), DEEPSEA task (Osband et al., 2020), 6-DoF JACO robotic arm environment from Laskin et al. (2021), and the FOURROOMS environment (Sutton et al., 1999).
Dataset Splits No The paper describes training procedures in terms of episodes and time steps (e.g., "pre-trains for 20,000 time steps", "trained for 1M time steps"), but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) typically found in supervised learning setups.
Hardware Specification Yes All experiments except for the robotic reaching experiment were performed on a single 8-core CPU. The robotic reaching experiment was performed using four Nvidia Quadro RTX 5000 GPUs.
Software Dependencies No The paper mentions software like JAX (Bradbury et al., 2018) and the Adam optimizer (Kingma & Ba, 2017) and refers to base code from Laskin et al. (2021). However, it does not provide specific version numbers for these software components, which are necessary for reproducible descriptions.
Experiment Setup Yes Table 3 lists hyperparameter settings for the DEEPSEA experiment including: optimizer Adam, learning rate 0.001, β 0.05, wQ, ws, wX (1, 100, 1000), B 32, replay buffer size 10,000, target update period 4, γ 0.99, ϵ 0.05.