Decoupling Representation Learning from Reinforcement Learning
Authors: Adam Stooke, Kimin Lee, Pieter Abbeel, Michael Laskin
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In online RL experiments, we show that training the encoder exclusively using ATC matches or outperforms end-to-end RL in most environments. Additionally, we benchmark several leading UL algorithms by pre-training encoders on expert demonstrations and using them, with weights frozen, in RL agents; we find that agents using ATC-trained encoders outperform all others. We also train multitask encoders on data from multiple environments and show generalization to different downstream RL tasks. Finally, we ablate components of ATC |
| Researcher Affiliation | Academia | 1University of California, Berkeley. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our experiments span visually diverse RL benchmarks in Deep Mind Control, Deep Mind Lab, and Atari, and our complete code is available at https://github.com/astooke/rlpyt/tree/master/rlpyt/ul. |
| Open Datasets | Yes | We evaluate ATC on three standard, visually diverse RL benchmarks the Deep Mind control suite (DMControl; Tassa et al. 2018), Atari games in the Arcade Learning Environment (Bellemare et al., 2013), and Deep Mind Lab (DMLab; Beattie et al. 2016). For convenience, we drew expert demonstrations from partially-trained RL agents, and every UL algorithm trained on the same data set for each environment. |
| Dataset Splits | No | The paper describes training and evaluation procedures but does not explicitly mention a dedicated validation dataset split or its size/percentage for hyperparameter tuning. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for running its experiments (e.g., GPU models, CPU models, memory details). |
| Software Dependencies | No | The paper mentions using PPO, RAD-SAC, and specific augmentations but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | A difference from prior work is that we use more downsampling in our convolutional network, by using strides (2, 2, 2, 1) instead of (2, 1, 1, 1) to reduce the convolution output image by 25x. For both Atari and DMLab, we use PPO (Schulman et al., 2017). In Atari, we use feed-forward agents, sticky actions, and no end-of-life boundaries for RL episodes. In DMLab we used recurrent, LSTM agents receiving only a single time-step image input, the four-layer convolution encoder from (Jaderberg et al., 2019), and we tuned the entropy bonus for each level. Since the ATC batch size was 512 but the RL batch size was 1024, performing twice as many UL updates still only consumed the same amount of encoder training data as RL. |