reproducibilityindex.ai

A First-Occupancy Representation for Reinforcement Learning

Authors: Ted Moskovitz, Spencer R Wilson, Maneesh Sahani

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now demonstrate the broad applicability of the FR, and highlight ways its properties differ from those of the SR. We focus on 4 areas: exploration, unsupervised RL, planning, and animal behavior. We tested our approach on the RIVERSWIM and SIXARMS problems (Strehl & Littman, 2008), two hard-exploration tasks from the PAC-MDP literature. The results are listed in Table 1
Researcher Affiliation	Academia	1Gatsby Unit, UCL 2 Sainsbury Wellcome Centre, UCL
Pseudocode	Yes	Algorithm 1: FR Planning (FRP) and Algorithm 2: Construct Plan are provided in Appendix A.2.
Open Source Code	Yes	We have attached code for the tabular experiments (also available at github.com/tedmoskovitz/first_occupancy)
Open Datasets	Yes	We tested our approach on the RIVERSWIM and SIXARMS problems (Strehl & Littman, 2008), Continuous Mountain Car task (Brockman et al., 2016), DEEPSEA task (Osband et al., 2020), 6-DoF JACO robotic arm environment from Laskin et al. (2021), and the FOURROOMS environment (Sutton et al., 1999).
Dataset Splits	No	The paper describes training procedures in terms of episodes and time steps (e.g., "pre-trains for 20,000 time steps", "trained for 1M time steps"), but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) typically found in supervised learning setups.
Hardware Specification	Yes	All experiments except for the robotic reaching experiment were performed on a single 8-core CPU. The robotic reaching experiment was performed using four Nvidia Quadro RTX 5000 GPUs.
Software Dependencies	No	The paper mentions software like JAX (Bradbury et al., 2018) and the Adam optimizer (Kingma & Ba, 2017) and refers to base code from Laskin et al. (2021). However, it does not provide specific version numbers for these software components, which are necessary for reproducible descriptions.
Experiment Setup	Yes	Table 3 lists hyperparameter settings for the DEEPSEA experiment including: optimizer Adam, learning rate 0.001, β 0.05, wQ, ws, wX (1, 100, 1000), B 32, replay buffer size 10,000, target update period 4, γ 0.99, ϵ 0.05.