Mind the Gap: Offline Policy Optimization for Imperfect Rewards

Authors: Jianxiong Li, Xiao Hu, Haoran Xu, Jingjing Liu, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Through extensive experiments on D4RL datasets (Fu et al., 2020), sparse reward tasks, multi-task data sharing tasks and a discrete-space navigation task, we demonstrate that RGM can achieve superior performance across diverse settings of imperfect rewards.
Researcher Affiliation Collaboration Tsinghua University, Beijing, China Shanghai Artificial Intelligence Laboratory, Shanghai, China {li-jx21,hu-x21}@mails.tsinghua.edu.cn, zhanxianyuan@air.tsinghua.edu.cn. This work is supported by funding from Haomo.AI, and National Natural Science Foundation of China under Grant 62125304, 62073182.
Pseudocode Yes Algorithm 1 RGM (KL-divergence) with Deep Neural Networks
Open Source Code Yes Code is available at https://github.com/Facebear-ljx/RGM.
Open Datasets Yes Through extensive experiments on D4RL datasets (Fu et al., 2020), sparse reward tasks, multi-task data sharing tasks and a discrete-space navigation task... D4RL (Fu et al., 2020) datasets... Robomimic (Mandlekar et al., 2021) Lift and Can tasks... Deep Mind Control Suite (Tassa et al., 2018).
Dataset Splits No The paper describes the composition of various datasets (D4RL, Robomimic, Ant Maze) and how expert data is sampled, but it does not specify explicit train/validation/test splits (e.g., percentages or counts for each) for reproducing the experimental setup.
Hardware Specification Yes We run RGM on one RTX 3080Ti GPU with about 1h30min training time to apply 1M gradient steps.
Software Dependencies No The paper mentions using "Adam" as an optimizer and cites the paper by Kingma & Ba (2015), but it does not specify the version of Adam used, nor does it list other software dependencies like specific deep learning frameworks (e.g., PyTorch, TensorFlow) or their version numbers.
Experiment Setup Yes Table 3: The hyperparameters of RGM with deep neural networks. This table details Architecture (Reward correction hidden dim, layers, activation function; Discriminator hidden dim, layers, activation function; V hidden dim, layers, activation function; Policy hidden dim, layers, activation function) and RGM Hyperparameters (Optimizer, Reward correction learning rate, learning rate schedule; Discriminator learning rate; Vθ learning rate; Policy learning rate, learning rate schedule; Vθ gradient L2-regularization; Discount factor; f-divergence; alpha values).