Improving Zero-Shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions

Authors: Bogdan Mazoure, Ilya Kostrikov, Ofir Nachum, Jonathan J. Tompson

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate performance of GSF and other baseline methods on both benchmarks, and show that GSF outperforms both previous state-of-the-art offline RL and representation learning baselines on the entire distribution of levels.
Researcher Affiliation Collaboration Bogdan Mazoure Mc Gill University, Quebec AI Institute; Ilya Kostrikov UC Berkeley Google Brain; Jonathan Tompson Google Brain
Pseudocode Yes Algorithm 1: Learn GVF(c,Dµ,i, (0),J, ,γ): Offline estimation of GVF ˆGµ i
Open Source Code Yes Code can be found at https://github.com/bmazoure/gsf_public.
Open Datasets Yes we devised two new benchmarks: offline Procgen (discrete actions) and offline Distracting Suite (continuous actions) two offline RL datasets to directly test for generalization of RL agents across observation functions.
Dataset Splits No An important distinction from online RL is that, in the offline RL setting, we assume access to a historical dataset Dµ (instead of a simulator) collected by logging experience of the policy, µ, in the form {oi,t,ai,t,ri,t}i=N,t=T i=1,t=1 where, for practical purposes, the episode is truncated at T timesteps. Furthermore, we assume that the agent can only be trained on a limited collection of POMDPs Mtrain = {Mi}m i=1, and its performance is evaluated on the set of test POMDPs Mtest.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments.
Software Dependencies No The paper does not provide specific version numbers for key software components or libraries used in the experiments.
Experiment Setup No The paper mentions "1 million gradient steps" for training on offline Procgen and "1M frames" for Distracting Control Suite, but does not specify concrete hyperparameter values such as learning rates, batch sizes, or optimizer settings.