The Utility of Sparse Representations for Control in Reinforcement Learning

Authors: Vincent Liu, Raksha Kumaraswamy, Lei Le, Martha White4384-4391

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate control performance on four benchmark domains: Mountain Car, Puddle World, Acrobot and Catcher. All domains are episodic, with discount set to 1 until termination. We choose these domains because they are well-understood, and typically considered relatively simple. A priori, it would be expected that a standard action-value method, like Sarsa, with a two-layer neural network, should be capable of learning a near-optimal policy in all four of these domains. We provide details about the domains in the Appendix. The experimental set-up is as follows. To extract a representation with a neural network, to be used for control, we pre-train the neural network on a batch of data with a mean-squared temporal difference error (MSTDE) objective and the applicable regularization strategies. The training data consists of trajectories generated by a fixed policy that explores much of the space in the various domains. For the SR-NN, we use our distributional regularization strategy, described in a later section. This learned representation is then fixed, and used by a (fully incremental) Sarsa(0) agent for learning a control policy, where only the weights w on the last layer are updated. The meta-parameters for the batch-trained neural network producing the representation and the Sarsa agent were swept in a wide range, and chosen based on control performance. The aim is to provide the best opportunity for a regular feed-forward network (NN) to learn on these problems, as it is more sensitive to its meta-parameters than the SR-NN. Additional details on ranges and objectives are provided in the Appendix. We choose this two-stage training regime to remove confounding factors in difficulties of training neural networks incrementally. Our goal here is to identify if a sparse representation can improve control performance, and if so, why. The networks are trained with an objective for learning values, on a large batch of data generated by a policy that covers the space; the learned representations are capable of representing the optimal policy. We investigate their utility for fully incremental learning. Outside of this carefully controlled experiment, we advocate for learning the representation incrementally, for the task faced by the agent. Cumulative reward per episode Mountain Car Puddle World Acrobot Catcher 100 50 100 50 100 50 Episode number SR-NN SR-NN Figure 2: Learning curves for Sarsa(0) comparing SR-NN, Tile Coding and vanilla NN in the four domains.
Researcher Affiliation Academia Vincent Liu,1 Raksha Kumaraswamy,1 Lei Le,2 Martha White1 1Department of Computing Science, University of Alberta, Edmonton, Canada {vliu1, kumarasw, whitem}@ualberta.ca 2Department of Computer Science, Indiana University Bloomington, Indiana, USA leile@indiana.edu
Pseudocode Yes We include pseudocode for optimizing the regularized objective with the SKL, in Algorithm 1 in the Appendix.
Open Source Code No The paper does not contain an explicit statement about releasing open-source code or a link to a code repository for the methodology described.
Open Datasets Yes We evaluate control performance on four benchmark domains: Mountain Car, Puddle World, Acrobot and Catcher. All domains are episodic, with discount set to 1 until termination. We choose these domains because they are well-understood, and typically considered relatively simple.
Dataset Splits No The paper states 'To extract a representation with a neural network, to be used for control, we pre-train the neural network on a batch of data...' but does not specify any training, validation, or test splits with percentages, sample counts, or a specific splitting methodology.
Hardware Specification No The paper does not specify the hardware used to run the experiments, such as GPU or CPU models, or memory specifications.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup Yes The experimental set-up is as follows. To extract a representation with a neural network, to be used for control, we pre-train the neural network on a batch of data with a mean-squared temporal difference error (MSTDE) objective and the applicable regularization strategies. The training data consists of trajectories generated by a fixed policy that explores much of the space in the various domains. For the SR-NN, we use our distributional regularization strategy, described in a later section. This learned representation is then fixed, and used by a (fully incremental) Sarsa(0) agent for learning a control policy, where only the weights w on the last layer are updated. The meta-parameters for the batch-trained neural network producing the representation and the Sarsa agent were swept in a wide range, and chosen based on control performance. The aim is to provide the best opportunity for a regular feed-forward network (NN) to learn on these problems, as it is more sensitive to its meta-parameters than the SR-NN. Additional details on ranges and objectives are provided in the Appendix. We choose this two-stage training regime to remove confounding factors in difficulties of training neural networks incrementally. Our goal here is to identify if a sparse representation can improve control performance, and if so, why. The networks are trained with an objective for learning values, on a large batch of data generated by a policy that covers the space; the learned representations are capable of representing the optimal policy. We investigate their utility for fully incremental learning. Outside of this carefully controlled experiment, we advocate for learning the representation incrementally, for the task faced by the agent. Both SR-NN and NN used two-layers, of size [32, 256], with Re LU activations. ... The bootstrap estimates, that correspond to the algorithm settings for the learning curves, are plotted in Figure 4(c). We can see that the relative ordering of the value estimates is maintained with SR-NN and Dropout-NN, which were the two NNs effective for on-policy control, and that their values converge to near the true values (given in Figure 4(d)). The other representations, on the other hand, have very poor estimates. Moreover, these estimates seem to decrease together, suggesting interference is causing overgeneralization to reduce values in other states. Finally, we report additional measures of locality, to determine if the successful methods are indeed sparse. The heatmaps provide some evidence of locality, but are more qualitative than quantitative. We report two qualitative measures: instance sparsity and activation overlap. Instance sparsity corresponds to the percentage of active units for each input. A sparse representation should be instance sparse, where most inputs produce relatively low percentage activation. As shown in Figure 5, SR-NN has consistently low instance sparsity across all four domains, with slightly higher level in Catcher, potentially explaining the noisy behaviour in that domain. Once again, Dropout-NN is noticeably more instance