Behavior From the Void: Unsupervised Active Pre-Training
Authors: Hao Liu, Pieter Abbeel
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate APT by exposing task-specific reward after a long unsupervised pre-training phase. In Atari games, APT achieves human-level performance on 12 games and obtains highly competitive performance compared to canonical fully supervised RL algorithms. On DMControl suite, APT beats all baselines in terms of asymptotic performance and data efficiency and dramatically improves performance on tasks that are extremely difficult to train from scratch. |
| Researcher Affiliation | Academia | Hao Liu UC Berkeley hao.liu@cs.berkeley.edu Pieter Abbeel UC Berkeley pabbeel@cs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1: Training APT |
| Open Source Code | No | The paper does not include an explicit statement about releasing the source code for the described methodology or provide a link to a code repository for APT. |
| Open Datasets | Yes | We test APT in Deep Mind Control Suite [DMControl; 58] and the Atari suite [9]. |
| Dataset Splits | No | The paper mentions 'pre-training phase' and 'testing period' but does not explicitly provide details on how validation data was split or used for hyperparameter tuning. It uses a replay buffer for training but doesn't define a distinct validation set or its properties. |
| Hardware Specification | No | The paper mentions 'GPU-based data augmentations' but does not provide specific details on CPU, GPU models, memory, or any other hardware used for running the experiments. |
| Software Dependencies | No | Kornia [50] is used for efficient GPU-based data augmentations. Our model is implemented in Numpy [21] and Py Torch [45]. No version numbers are specified for these libraries. |
| Experiment Setup | Yes | For our Deep Mind control suite and Atari games experiments, we largely follow Dr Q, except we perform two gradient steps per environment step instead of one. Following Dr Q, the representation encoder fθ( ) is implemented by the convolutional residual network followed by a fully-connected layer, a Layer Norm and a Tanh non-linearity. We decrease the output dimension of the fully-connected layer after the convnet from 50 to 15. We find it helps to use spectral normalization [39] to normalize the weights and use ELU [15] as the non-linearity in between convolutional layers. |