Mutual Information Regularized Offline Reinforcement Learning
Authors: Xiao Ma, Bingyi Kang, Zhongwen Xu, Min Lin, Shuicheng Yan
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark, e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is attached and will be released upon publication. |
| Researcher Affiliation | Industry | Xiao Ma Bingyi Kang Zhongwen Xu Min Lin Shuicheng Yan Sea AI Lab {yusufma555, bingykang}@gmail.com |
| Pseudocode | Yes | Algorithm 1 Mutual Information Regularized Offline RL Input: Initialize Q network Qϕ, policy network πθ, dataset D, hyperparameters α1 and α2. for t {1, . . . , MAX_STEP} do Train the Q network by gradient descent with objective JQ(ϕ) in Eqn. 12: ϕ := ϕ ηQ ϕJQ(ϕ) Improve policy network by gradient ascent with object Jπ(θ) in Eqn. 13: θ := θ + ηπ θEs D,a πθ(a|s)[Qϕ(s, a)] + α2 θIMISA end Output: The well-trained πθ. |
| Open Source Code | Yes | Our code is attached and will be released upon publication. |
| Open Datasets | Yes | Our code is implemented in JAX [7] with Flax [19]. |
| Dataset Splits | No | The paper refers to using specific datasets (e.g., D4RL benchmark, antmaze-v0, gym-locomotion-v2) but does not provide explicit train/validation/test splits by percentages or sample counts in the main text. It mentions 'average the mean returns over 10 evaluation trajectories and 5 random seeds' and 'evaluate the antmaze-v0 environments for 100 episodes instead', which implies evaluation on a test set, but no split details. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA 3090 GPUs. |
| Software Dependencies | No | The paper mentions software like JAX [7] and Flax [19], and base RL algorithms like SAC [17], but does not provide specific version numbers for these software dependencies (e.g., 'JAX 0.3.17' or 'Flax 0.6.9'). |
| Experiment Setup | Yes | We use ELU activation function [11] and SAC [17] as the base RL algorithm. Besides, we use a learning rate of 1 10 4 for both the policy network and Q-value network with a cosine learning rate scheduler. When approximating Eπθ(a|s) e Tψ(s,a) , we use 50 Monte-Carlo samples. |