Representation Matters: Offline Pretraining for Sequential Decision Making
Authors: Mengjiao Yang, Ofir Nachum
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a variety of experiments utilizing standard offline RL datasets, we find that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms that otherwise yield mediocre performance on their own. Extensive ablations further provide insights into what components of these unsupervised objectives e.g., reward prediction, continuous or discrete representations, pretraining or finetuning are most important and in which settings. |
| Researcher Affiliation | Industry | 1Google Research, Google Brain. Correspondence to: Mengjiao Yang <sherryy@google.com>. |
| Pseudocode | No | The paper describes various representation learning objectives but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code available at https://github.com/google-research/googleresearch/tree/master/rl_repr. |
| Open Datasets | Yes | We leverage the Gym-Mu Jo Co datasets from D4RL (Fu et al., 2020) |
| Dataset Splits | No | The paper describes using different datasets for pretraining and downstream tasks (e.g., D4RL medium/medium-replay for pretraining, D4RL expert for imitation learning) and evaluation frequency ('every 10k steps, we evaluate the learned policy'), but does not specify explicit train/validation/test dataset splits from a single dataset. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | Unless otherwise noted, a single seed corresponds to an initial pretraining phase of 200k steps, in which a representation learning objective is optimized using batches of 256 sub-trajectories randomly sampled from the offline dataset. After pretraining, the learned representation is fixed and applied to the downstream task, which performs the appropriate training (BC, BRAC, or SAC) for 1M steps. ...we fix these to values which we found to generally perform best (regularization strength of 1.0 and policy learning rate of 0.00003). |