Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning

Authors: Yang Yue, Bingyi Kang, Zhongwen Xu, Gao Huang, Shuicheng Yan

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on Atari 100K and Deep Mind Control Suite benchmarks to validate their effectiveness in improving sample efficiency. It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
Researcher Affiliation Collaboration Yang Yue1,2*, Bingyi Kang2 , Zhongwen Xu2, Gao Huang1, Shuicheng Yan2 1Department of Automation, BNRist, Tsinghua University 2Sea AI Lab yueyang22f@gmail.com, {kangby, xuzw, yansc}@sea.com, gaohuang@tsinghua.edu.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No Code will be open-sourced upon acceptance.
Open Datasets Yes We conduct experiments on Atari 100K (Bellemare et al. 2013; Kaiser et al. 2020) and Deep Mind Control Suite (Tassa et al. 2018) benchmarks to validate their effectiveness in improving sample efficiency.
Dataset Splits No No explicit mention of specific training, validation, or test dataset splits (e.g., percentages or exact sample counts for each split) was found. The paper refers to '100K' and '500K environment steps' which indicate interaction limits, and '10 seeds for each game' for evaluation, but not specific dataset partitioning.
Hardware Specification No No specific hardware details (such as GPU models, CPU models, or memory specifications) used for running experiments were mentioned in the paper.
Software Dependencies No The paper mentions 'Adam Optimizer (Kingma and Ba 2015)' and refers to using 'the official code of SPR' and 'modified SPR for continuous control' as codebases, but does not provide specific version numbers for software dependencies like programming languages, frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup Yes For discrete action tasks, the prediction step is set K = 5. Q-learning loss and Value-Consistent loss are optimized jointly by an Adam Optimizer (Kingma and Ba 2015), where the batch size is 32. For continuous control tasks, the prediction step is set K = 3. Actor loss, critic loss, and Value-Consistent loss are optimized separately by three Adam optimizers, where the batch size for the actor-critic update is 512 and the batch size for VCR update is 128.