reproducibilityindex.ai

Representation Matters: Offline Pretraining for Sequential Decision Making

Authors: Mengjiao Yang, Ofir Nachum

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a variety of experiments utilizing standard ofﬂine RL datasets, we ﬁnd that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms that otherwise yield mediocre performance on their own. Extensive ablations further provide insights into what components of these unsupervised objectives e.g., reward prediction, continuous or discrete representations, pretraining or ﬁnetuning are most important and in which settings.
Researcher Affiliation	Industry	1Google Research, Google Brain. Correspondence to: Mengjiao Yang <sherryy@google.com>.
Pseudocode	No	The paper describes various representation learning objectives but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	1Code available at https://github.com/google-research/googleresearch/tree/master/rl_repr.
Open Datasets	Yes	We leverage the Gym-Mu Jo Co datasets from D4RL (Fu et al., 2020)
Dataset Splits	No	The paper describes using different datasets for pretraining and downstream tasks (e.g., D4RL medium/medium-replay for pretraining, D4RL expert for imitation learning) and evaluation frequency ('every 10k steps, we evaluate the learned policy'), but does not specify explicit train/validation/test dataset splits from a single dataset.
Hardware Specification	No	The paper does not provide specific details on the hardware used for running experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks).
Experiment Setup	Yes	Unless otherwise noted, a single seed corresponds to an initial pretraining phase of 200k steps, in which a representation learning objective is optimized using batches of 256 sub-trajectories randomly sampled from the ofﬂine dataset. After pretraining, the learned representation is ﬁxed and applied to the downstream task, which performs the appropriate training (BC, BRAC, or SAC) for 1M steps. ...we ﬁx these to values which we found to generally perform best (regularization strength of 1.0 and policy learning rate of 0.00003).