Decoupling Representation Learning from Reinforcement Learning

Authors: Adam Stooke, Kimin Lee, Pieter Abbeel, Michael Laskin

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In online RL experiments, we show that training the encoder exclusively using ATC matches or outperforms end-to-end RL in most environments. Additionally, we benchmark several leading UL algorithms by pre-training encoders on expert demonstrations and using them, with weights frozen, in RL agents; we find that agents using ATC-trained encoders outperform all others. We also train multitask encoders on data from multiple environments and show generalization to different downstream RL tasks. Finally, we ablate components of ATC
Researcher Affiliation Academia 1University of California, Berkeley.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our experiments span visually diverse RL benchmarks in Deep Mind Control, Deep Mind Lab, and Atari, and our complete code is available at https://github.com/astooke/rlpyt/tree/master/rlpyt/ul.
Open Datasets Yes We evaluate ATC on three standard, visually diverse RL benchmarks the Deep Mind control suite (DMControl; Tassa et al. 2018), Atari games in the Arcade Learning Environment (Bellemare et al., 2013), and Deep Mind Lab (DMLab; Beattie et al. 2016). For convenience, we drew expert demonstrations from partially-trained RL agents, and every UL algorithm trained on the same data set for each environment.
Dataset Splits No The paper describes training and evaluation procedures but does not explicitly mention a dedicated validation dataset split or its size/percentage for hyperparameter tuning.
Hardware Specification No The paper does not explicitly describe the specific hardware used for running its experiments (e.g., GPU models, CPU models, memory details).
Software Dependencies No The paper mentions using PPO, RAD-SAC, and specific augmentations but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes A difference from prior work is that we use more downsampling in our convolutional network, by using strides (2, 2, 2, 1) instead of (2, 1, 1, 1) to reduce the convolution output image by 25x. For both Atari and DMLab, we use PPO (Schulman et al., 2017). In Atari, we use feed-forward agents, sticky actions, and no end-of-life boundaries for RL episodes. In DMLab we used recurrent, LSTM agents receiving only a single time-step image input, the four-layer convolution encoder from (Jaderberg et al., 2019), and we tuned the entropy bonus for each level. Since the ATC batch size was 512 but the RL batch size was 1024, performing twice as many UL updates still only consumed the same amount of encoder training data as RL.