Doubly Mild Generalization for Offline Reinforcement Learning
Authors: Yixiu Mao, Qi Wang, Yun Qu, Yuhang Jiang, Xiangyang Ji
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, DMG achieves state-of-the-art performance across Gym-Mu Jo Co locomotion tasks and challenging Ant Maze tasks. |
| Researcher Affiliation | Academia | Yixiu Mao1, Qi Wang1, Yun Qu1, Yuhang Jiang1, Xiangyang Ji1 1Department of Automation, Tsinghua University myx21@mails.tsinghua.edu.cn, xyji@tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1 DMG 1: Initialize πϕ, πϕ , Qθ, Qθ , and Vψ. 2: for each gradient step do 3: Update ψ by minimizing Eq. (15) 4: Update θ by minimizing Eq. (16) 5: Update ϕ by maximizing Eq. (14) 6: Update target networks: θ (1 ξ)θ + ξθ, ϕ (1 ξ)ϕ + ξϕ 7: end for |
| Open Source Code | Yes | Our code is available at https://github.com/maoyixiu/DMG. |
| Open Datasets | Yes | We evaluate the proposed approach on Gym-Mu Jo Co locomotion domains and challenging Ant Maze domains in D4RL [16]. [16] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020. |
| Dataset Splits | No | The paper describes how evaluation is performed (e.g., averaging returns over evaluation trajectories and random seeds) but does not explicitly provide percentages or counts for training, validation, and test dataset splits. |
| Hardware Specification | Yes | We test the runtime of DMG and other baselines on a Ge Force RTX 3090. |
| Software Dependencies | No | The paper mentions several algorithms and optimizers (e.g., TD3, IQL, XQL, SQL, Adam) and their specific parameters. However, it does not provide specific version numbers for underlying software dependencies like Python, PyTorch/TensorFlow, CUDA, or other libraries. |
| Experiment Setup | Yes | Table 5: Hyperparameters of DMG. Includes Optimizer, Critic learning rate, Actor learning rate, Batch size, Discount factor, Number of iterations, Target update rate, Number of Critics, Penalty coefficient, Expectile, Inverse temperature, and Architecture details. |