Learning Value Functions from Undirected State-only Experience

Authors: Matthew Chang, Arjun Gupta, Saurabh Gupta

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods.
Researcher Affiliation Academia Matthew Chang Arjun Gupta Saurabh Gupta University of Illinois at Urbana-Champaign {mc48, arjung2, saurabhg}@illinois.edu
Pseudocode Yes Section A.8 ALGORITHM contains 'Algorithm 1 LAQ'.
Open Source Code No The paper provides a 'Project website: https://matthewchang.github.io/latent action qlearning site/.' but this is a project overview page and not a direct link to a source-code repository as specified.
Open Datasets Yes We use the Maze2D data from D4RL Fu et al. (2020).
Dataset Splits No The paper mentions training and evaluating models but does not provide specific percentages or counts for training, validation, or test splits. It implies some form of validation for model selection in Section A.11: 'The numbers reported in Table 1 are the 95th percentile Spearman s correlation coefficients over the course of training. If training is stable and converges, this corresponds to taking the final value, and in the case that training is not stable and diverges, this acts as a form of early stopping.'
Hardware Specification No The paper states 'We use the ACME codebase (Hoffman et al., 2020) for experiments.' but does not provide any specific hardware details such as GPU models, CPU types, or cloud computing specifications.
Software Dependencies No The paper mentions using 'the ACME codebase (Hoffman et al., 2020)' but does not provide specific version numbers for ACME or any other software dependencies.
Experiment Setup Yes In all settings we do Q-learning over the top 8 dominant actions, except for Freeway, where using the top three actions stabilized training. We use a multi-layer perceptron for fθ and L2 loss for l. We scale up the sparse task rewards by a factor of 5 so that behavior is dominated by the task reward once policy starts solving the task. Results are averaged over 5 seeds and show standard error.