Mind the Gap: Offline Policy Optimization for Imperfect Rewards
Authors: Jianxiong Li, Xiao Hu, Haoran Xu, Jingjing Liu, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Through extensive experiments on D4RL datasets (Fu et al., 2020), sparse reward tasks, multi-task data sharing tasks and a discrete-space navigation task, we demonstrate that RGM can achieve superior performance across diverse settings of imperfect rewards. |
| Researcher Affiliation | Collaboration | Tsinghua University, Beijing, China Shanghai Artificial Intelligence Laboratory, Shanghai, China {li-jx21,hu-x21}@mails.tsinghua.edu.cn, zhanxianyuan@air.tsinghua.edu.cn. This work is supported by funding from Haomo.AI, and National Natural Science Foundation of China under Grant 62125304, 62073182. |
| Pseudocode | Yes | Algorithm 1 RGM (KL-divergence) with Deep Neural Networks |
| Open Source Code | Yes | Code is available at https://github.com/Facebear-ljx/RGM. |
| Open Datasets | Yes | Through extensive experiments on D4RL datasets (Fu et al., 2020), sparse reward tasks, multi-task data sharing tasks and a discrete-space navigation task... D4RL (Fu et al., 2020) datasets... Robomimic (Mandlekar et al., 2021) Lift and Can tasks... Deep Mind Control Suite (Tassa et al., 2018). |
| Dataset Splits | No | The paper describes the composition of various datasets (D4RL, Robomimic, Ant Maze) and how expert data is sampled, but it does not specify explicit train/validation/test splits (e.g., percentages or counts for each) for reproducing the experimental setup. |
| Hardware Specification | Yes | We run RGM on one RTX 3080Ti GPU with about 1h30min training time to apply 1M gradient steps. |
| Software Dependencies | No | The paper mentions using "Adam" as an optimizer and cites the paper by Kingma & Ba (2015), but it does not specify the version of Adam used, nor does it list other software dependencies like specific deep learning frameworks (e.g., PyTorch, TensorFlow) or their version numbers. |
| Experiment Setup | Yes | Table 3: The hyperparameters of RGM with deep neural networks. This table details Architecture (Reward correction hidden dim, layers, activation function; Discriminator hidden dim, layers, activation function; V hidden dim, layers, activation function; Policy hidden dim, layers, activation function) and RGM Hyperparameters (Optimizer, Reward correction learning rate, learning rate schedule; Discriminator learning rate; Vθ learning rate; Policy learning rate, learning rate schedule; Vθ gradient L2-regularization; Discount factor; f-divergence; alpha values). |