Learning Value Functions from Undirected State-only Experience
Authors: Matthew Chang, Arjun Gupta, Saurabh Gupta
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods. |
| Researcher Affiliation | Academia | Matthew Chang Arjun Gupta Saurabh Gupta University of Illinois at Urbana-Champaign {mc48, arjung2, saurabhg}@illinois.edu |
| Pseudocode | Yes | Section A.8 ALGORITHM contains 'Algorithm 1 LAQ'. |
| Open Source Code | No | The paper provides a 'Project website: https://matthewchang.github.io/latent action qlearning site/.' but this is a project overview page and not a direct link to a source-code repository as specified. |
| Open Datasets | Yes | We use the Maze2D data from D4RL Fu et al. (2020). |
| Dataset Splits | No | The paper mentions training and evaluating models but does not provide specific percentages or counts for training, validation, or test splits. It implies some form of validation for model selection in Section A.11: 'The numbers reported in Table 1 are the 95th percentile Spearman s correlation coefficients over the course of training. If training is stable and converges, this corresponds to taking the final value, and in the case that training is not stable and diverges, this acts as a form of early stopping.' |
| Hardware Specification | No | The paper states 'We use the ACME codebase (Hoffman et al., 2020) for experiments.' but does not provide any specific hardware details such as GPU models, CPU types, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions using 'the ACME codebase (Hoffman et al., 2020)' but does not provide specific version numbers for ACME or any other software dependencies. |
| Experiment Setup | Yes | In all settings we do Q-learning over the top 8 dominant actions, except for Freeway, where using the top three actions stabilized training. We use a multi-layer perceptron for fθ and L2 loss for l. We scale up the sparse task rewards by a factor of 5 so that behavior is dominated by the task reward once policy starts solving the task. Results are averaged over 5 seeds and show standard error. |