Learning State Representations from Random Deep Action-conditional Predictions

Authors: Zeyu Zheng, Vivek Veeriah, Risto Vuorio, Richard L Lewis, Satinder Singh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main contribution in this work is an empirical finding that random General Value Functions (GVFs), i.e., deep action-conditional predictions random both in what feature of observations they predict as well as in the sequence of actions the predictions are conditioned upon form good auxiliary tasks for reinforcement learning (RL) problems. In particular, we show that random deep action-conditional predictions when used as auxiliary tasks yield state representations that produce control performance competitive with state-of-the-art hand-crafted auxiliary tasks like value prediction, pixel control, and CURL in both Atari and Deep Mind Lab tasks. In this section, we present the empirical results of comparing the performance of r GVFs against the A2C baseline [20] and three other auxiliary tasks, i.e., multi-horizon value prediction (MHVP) [8], pixel control (PC) [13], and CURL [15]. We conducted the evaluation in 49 Atari games [4] and 12 Deep Mind Lab environments [2].
Researcher Affiliation Academia Zeyu Zheng University of Michigan zeyu@umich.edu Vivek Veeriah University of Michigan vveeriah@umich.edu Risto Vuorio University of Oxford risto.vuorio@cs.ox.ac.uk Richard Lewis University of Michigan rickl@umich.edu Satinder Singh University of Michigan baveja@umich.edu
Pseudocode Yes The Appendix includes pseudocode for the random generator algorithm.
Open Source Code Yes We opensourced our code at https://github.com/Hwhitetooth/random_gvfs.
Open Datasets Yes We conducted the evaluation in 49 Atari games [4] and 12 Deep Mind Lab environments [2].
Dataset Splits No We searched c in {0.1, 0.2, 0.5, 1, 2} on the 6 games in the previous section. (This describes hyperparameter search across a subset of games, not a formal validation split of a dataset with explicit percentages or counts.)
Hardware Specification No The paper does not provide specific hardware details like GPU/CPU models or memory in the main text. The checklist indicates this information is in the Appendix, but the question asks about the main text.
Software Dependencies No The paper mentions specific frameworks like A2C but does not provide specific version numbers for software dependencies in the main text.
Experiment Setup Yes Hyperparameters. The discount factor, depth, and repeat were set to 0.95, 8, and 16 respectively. Thus there are 16 + 8 16 |A| total predictions. Random GVFs without action-conditioning has the same question network except that no prediction was conditioned on actions. To match the total number of predictions, we used 16 + 8 16 |A| random features for discounted sum and 8 16 features for shallow action-conditional predictions. Additional random features were generated by applying more random linear functions to the image patches. The discount factor for discounted sum predictions is also 0.95. More implementation details are provided in the Appendix. We searched c in {0.1, 0.2, 0.5, 1, 2} on the 6 games in the previous section. c = 1 worked the best for all methods. Neural Network Architecture. We used A2C [20] with a standard neural network architecture for Atari [21] as our base agent. Specifically, the state representation module consists of 3 convolutional layers. The RL module has one hidden dense layer and two output heads for the policy and the value function respectively. The answer network has one hidden dense layer with 512 units followed by the output layer.