reproducibilityindex.ai

Learning Value Functions from Undirected State-only Experience

Authors: Matthew Chang, Arjun Gupta, Saurabh Gupta

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the beneﬁts of LAQ over simpler alternatives, imitation learning oracles, and competing methods.
Researcher Affiliation	Academia	Matthew Chang Arjun Gupta Saurabh Gupta University of Illinois at Urbana-Champaign {mc48, arjung2, saurabhg}@illinois.edu
Pseudocode	Yes	Section A.8 ALGORITHM contains 'Algorithm 1 LAQ'.
Open Source Code	No	The paper provides a 'Project website: https://matthewchang.github.io/latent action qlearning site/.' but this is a project overview page and not a direct link to a source-code repository as specified.
Open Datasets	Yes	We use the Maze2D data from D4RL Fu et al. (2020).
Dataset Splits	No	The paper mentions training and evaluating models but does not provide specific percentages or counts for training, validation, or test splits. It implies some form of validation for model selection in Section A.11: 'The numbers reported in Table 1 are the 95th percentile Spearman s correlation coefﬁcients over the course of training. If training is stable and converges, this corresponds to taking the ﬁnal value, and in the case that training is not stable and diverges, this acts as a form of early stopping.'
Hardware Specification	No	The paper states 'We use the ACME codebase (Hoffman et al., 2020) for experiments.' but does not provide any specific hardware details such as GPU models, CPU types, or cloud computing specifications.
Software Dependencies	No	The paper mentions using 'the ACME codebase (Hoffman et al., 2020)' but does not provide specific version numbers for ACME or any other software dependencies.
Experiment Setup	Yes	In all settings we do Q-learning over the top 8 dominant actions, except for Freeway, where using the top three actions stabilized training. We use a multi-layer perceptron for fθ and L2 loss for l. We scale up the sparse task rewards by a factor of 5 so that behavior is dominated by the task reward once policy starts solving the task. Results are averaged over 5 seeds and show standard error.