Mutual Information Regularized Offline Reinforcement Learning

Authors: Xiao Ma, Bingyi Kang, Zhongwen Xu, Min Lin, Shuicheng Yan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark, e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is attached and will be released upon publication.
Researcher Affiliation Industry Xiao Ma Bingyi Kang Zhongwen Xu Min Lin Shuicheng Yan Sea AI Lab {yusufma555, bingykang}@gmail.com
Pseudocode Yes Algorithm 1 Mutual Information Regularized Offline RL Input: Initialize Q network Qϕ, policy network πθ, dataset D, hyperparameters α1 and α2. for t {1, . . . , MAX_STEP} do Train the Q network by gradient descent with objective JQ(ϕ) in Eqn. 12: ϕ := ϕ ηQ ϕJQ(ϕ) Improve policy network by gradient ascent with object Jπ(θ) in Eqn. 13: θ := θ + ηπ θEs D,a πθ(a|s)[Qϕ(s, a)] + α2 θIMISA end Output: The well-trained πθ.
Open Source Code Yes Our code is attached and will be released upon publication.
Open Datasets Yes Our code is implemented in JAX [7] with Flax [19].
Dataset Splits No The paper refers to using specific datasets (e.g., D4RL benchmark, antmaze-v0, gym-locomotion-v2) but does not provide explicit train/validation/test splits by percentages or sample counts in the main text. It mentions 'average the mean returns over 10 evaluation trajectories and 5 random seeds' and 'evaluate the antmaze-v0 environments for 100 episodes instead', which implies evaluation on a test set, but no split details.
Hardware Specification Yes All experiments are conducted on NVIDIA 3090 GPUs.
Software Dependencies No The paper mentions software like JAX [7] and Flax [19], and base RL algorithms like SAC [17], but does not provide specific version numbers for these software dependencies (e.g., 'JAX 0.3.17' or 'Flax 0.6.9').
Experiment Setup Yes We use ELU activation function [11] and SAC [17] as the base RL algorithm. Besides, we use a learning rate of 1 10 4 for both the policy network and Q-value network with a cosine learning rate scheduler. When approximating Eπθ(a|s) e Tψ(s,a) , we use 50 Monte-Carlo samples.