Doubly Mild Generalization for Offline Reinforcement Learning

Authors: Yixiu Mao, Qi Wang, Yun Qu, Yuhang Jiang, Xiangyang Ji

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, DMG achieves state-of-the-art performance across Gym-Mu Jo Co locomotion tasks and challenging Ant Maze tasks.
Researcher Affiliation Academia Yixiu Mao1, Qi Wang1, Yun Qu1, Yuhang Jiang1, Xiangyang Ji1 1Department of Automation, Tsinghua University myx21@mails.tsinghua.edu.cn, xyji@tsinghua.edu.cn
Pseudocode Yes Algorithm 1 DMG 1: Initialize πϕ, πϕ , Qθ, Qθ , and Vψ. 2: for each gradient step do 3: Update ψ by minimizing Eq. (15) 4: Update θ by minimizing Eq. (16) 5: Update ϕ by maximizing Eq. (14) 6: Update target networks: θ (1 ξ)θ + ξθ, ϕ (1 ξ)ϕ + ξϕ 7: end for
Open Source Code Yes Our code is available at https://github.com/maoyixiu/DMG.
Open Datasets Yes We evaluate the proposed approach on Gym-Mu Jo Co locomotion domains and challenging Ant Maze domains in D4RL [16]. [16] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.
Dataset Splits No The paper describes how evaluation is performed (e.g., averaging returns over evaluation trajectories and random seeds) but does not explicitly provide percentages or counts for training, validation, and test dataset splits.
Hardware Specification Yes We test the runtime of DMG and other baselines on a Ge Force RTX 3090.
Software Dependencies No The paper mentions several algorithms and optimizers (e.g., TD3, IQL, XQL, SQL, Adam) and their specific parameters. However, it does not provide specific version numbers for underlying software dependencies like Python, PyTorch/TensorFlow, CUDA, or other libraries.
Experiment Setup Yes Table 5: Hyperparameters of DMG. Includes Optimizer, Critic learning rate, Actor learning rate, Batch size, Discount factor, Number of iterations, Target update rate, Number of Critics, Penalty coefficient, Expectile, Inverse temperature, and Architecture details.