Boosting Offline Reinforcement Learning with Action Preference Query

Authors: Qisen Yang, Shenzhi Wang, Matthieu Gaetan Lin, Shiji Song, Gao Huang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Moreover, comprehensive experiments on the D4RL benchmark and state-of-the-art algorithms demonstrate that OAP yields higher (29% on average) scores, especially on challenging Ant Maze tasks (98% higher). Empirically, we instantiate OAP with state-of-the-art offline RL algorithms and perform proof-of-concept investigations on the D4RL benchmark (Fu et al., 2020).
Researcher Affiliation Academia 1Department of Automation, BNRist, Tsinghua University, Beijing, China 2Department of Computer Science, BNRist, Tsinghua University, Beijing, China.
Pseudocode Yes Algorithm 1 Offline-with-Action-Preferences
Open Source Code No The paper provides a link to the `rlkit` repository (https://github.com/rail-berkeley/rlkit), which is a library used for pre-training, but there is no explicit statement that the authors' specific implementation of OAP or the code for their described methodology is open-sourced or available.
Open Datasets Yes We consider three different domains of tasks in D4RL (Fu et al., 2020) benchmark: Gym, Ant Maze, and Adroit.
Dataset Splits No The paper does not explicitly specify how training, validation, and test splits were defined or used within their experimental setup, such as specific percentages, number of samples, or reference to standard D4RL splits for these purposes.
Hardware Specification No The Acknowledgments section mentions a "generous donation of computing resources by High-Flyer AI" but does not provide any specific details about the hardware used, such as CPU models, GPU models, or other relevant specifications.
Software Dependencies No Table 5 lists hyperparameters for optimizers (Adam) and activation functions (ReLU) with citations to their original papers, but it does not specify version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software components crucial for replication.
Experiment Setup Yes The hyperparameters of OAP instantiated on TD3+BC (Fujimoto & Gu, 2021) and IQL (Kostrikov et al., 2022) are presented in Table 5. Table 5 includes detailed settings such as Critic learning rate 3e-4, Mini-batch size 256, and Discount factor 0.99.