Boosting Offline Reinforcement Learning with Action Preference Query
Authors: Qisen Yang, Shenzhi Wang, Matthieu Gaetan Lin, Shiji Song, Gao Huang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Moreover, comprehensive experiments on the D4RL benchmark and state-of-the-art algorithms demonstrate that OAP yields higher (29% on average) scores, especially on challenging Ant Maze tasks (98% higher). Empirically, we instantiate OAP with state-of-the-art offline RL algorithms and perform proof-of-concept investigations on the D4RL benchmark (Fu et al., 2020). |
| Researcher Affiliation | Academia | 1Department of Automation, BNRist, Tsinghua University, Beijing, China 2Department of Computer Science, BNRist, Tsinghua University, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 Offline-with-Action-Preferences |
| Open Source Code | No | The paper provides a link to the `rlkit` repository (https://github.com/rail-berkeley/rlkit), which is a library used for pre-training, but there is no explicit statement that the authors' specific implementation of OAP or the code for their described methodology is open-sourced or available. |
| Open Datasets | Yes | We consider three different domains of tasks in D4RL (Fu et al., 2020) benchmark: Gym, Ant Maze, and Adroit. |
| Dataset Splits | No | The paper does not explicitly specify how training, validation, and test splits were defined or used within their experimental setup, such as specific percentages, number of samples, or reference to standard D4RL splits for these purposes. |
| Hardware Specification | No | The Acknowledgments section mentions a "generous donation of computing resources by High-Flyer AI" but does not provide any specific details about the hardware used, such as CPU models, GPU models, or other relevant specifications. |
| Software Dependencies | No | Table 5 lists hyperparameters for optimizers (Adam) and activation functions (ReLU) with citations to their original papers, but it does not specify version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software components crucial for replication. |
| Experiment Setup | Yes | The hyperparameters of OAP instantiated on TD3+BC (Fujimoto & Gu, 2021) and IQL (Kostrikov et al., 2022) are presented in Table 5. Table 5 includes detailed settings such as Critic learning rate 3e-4, Mini-batch size 256, and Discount factor 0.99. |