Learning State Representations via Retracing in Reinforcement Learning

Authors: Changmin Yu, Dong Li, Jianye HAO, Jun Wang, Neil Burgess

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive empirical studies on visual-based continuous control benchmarks, we demonstrate that CCWM achieves state-of-the-art performance in terms of sample efficiency and asymptotic performance, whilst exhibiting behaviours that are indicative of stronger representation learning.
Researcher Affiliation Collaboration Changmin Yu1 , Dong Li2, Jianye Hao3, 2, Jun Wang1, 2, Neil Burgess1 1UCL, London, United Kingdom 2Huawei Noah s Ark Lab 3College of Intelligence and Computing, Tianjin University
Pseudocode Yes The pseudocode for CCWM training is shown in Algorithm 1.
Open Source Code Yes The python implementation of CCWM can be found at https://github.com/changmin-yu/ CCWM_code.
Open Datasets Yes We base our experimental studies on the challenging visual-based continuous control benchmarks for which we choose 8 tasks from the Deep Mind Control Suite (Tassa et al. (2018); Figure. 3a).
Dataset Splits No The paper mentions 'Greedy evaluation is performed every 104 training steps' and 'The reported evaluation scores are averaged values over 5 random seeds', but does not explicitly provide details about specific training/validation/test dataset splits (e.g., percentages or sample counts for each split).
Hardware Specification No The paper mentions the use of TensorFlow for implementation but does not provide any specific hardware details such as GPU models, CPU types, or cloud computing specifications used for running the experiments.
Software Dependencies No The paper mentions the use of 'Tensor Flow' and 'Tensor Flow Distributions' but does not specify their version numbers, nor does it list other software dependencies with their versions.
Experiment Setup Yes For the actual training, the batch size is chosen to be 64, and all sampled trajectories are taken to be 50 timesteps long... The parameter λ controlling the weights of the retrace auxiliary loss in Eq. 8 is set to 1.0. The discounting factor for the expected value function is set to 0.99. The default values for the parameters we used for the empirical evaluation shown in Figure 5 are: η = 0.10, S = 10, τ = 5, ξ = 1 105.