A First-Occupancy Representation for Reinforcement Learning
Authors: Ted Moskovitz, Spencer R Wilson, Maneesh Sahani
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now demonstrate the broad applicability of the FR, and highlight ways its properties differ from those of the SR. We focus on 4 areas: exploration, unsupervised RL, planning, and animal behavior. We tested our approach on the RIVERSWIM and SIXARMS problems (Strehl & Littman, 2008), two hard-exploration tasks from the PAC-MDP literature. The results are listed in Table 1 |
| Researcher Affiliation | Academia | 1Gatsby Unit, UCL 2 Sainsbury Wellcome Centre, UCL |
| Pseudocode | Yes | Algorithm 1: FR Planning (FRP) and Algorithm 2: Construct Plan are provided in Appendix A.2. |
| Open Source Code | Yes | We have attached code for the tabular experiments (also available at github.com/tedmoskovitz/first_occupancy) |
| Open Datasets | Yes | We tested our approach on the RIVERSWIM and SIXARMS problems (Strehl & Littman, 2008), Continuous Mountain Car task (Brockman et al., 2016), DEEPSEA task (Osband et al., 2020), 6-DoF JACO robotic arm environment from Laskin et al. (2021), and the FOURROOMS environment (Sutton et al., 1999). |
| Dataset Splits | No | The paper describes training procedures in terms of episodes and time steps (e.g., "pre-trains for 20,000 time steps", "trained for 1M time steps"), but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) typically found in supervised learning setups. |
| Hardware Specification | Yes | All experiments except for the robotic reaching experiment were performed on a single 8-core CPU. The robotic reaching experiment was performed using four Nvidia Quadro RTX 5000 GPUs. |
| Software Dependencies | No | The paper mentions software like JAX (Bradbury et al., 2018) and the Adam optimizer (Kingma & Ba, 2017) and refers to base code from Laskin et al. (2021). However, it does not provide specific version numbers for these software components, which are necessary for reproducible descriptions. |
| Experiment Setup | Yes | Table 3 lists hyperparameter settings for the DEEPSEA experiment including: optimizer Adam, learning rate 0.001, β 0.05, wQ, ws, wX (1, 100, 1000), B 32, replay buffer size 10,000, target update period 4, γ 0.99, ϵ 0.05. |