Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mutual Information Regularized Offline Reinforcement Learning
Authors: Xiao Ma, Bingyi Kang, Zhongwen Xu, Min Lin, Shuicheng Yan
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark, e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is attached and will be released upon publication. |
| Researcher Affiliation | Industry | Xiao Ma Bingyi Kang Zhongwen Xu Min Lin Shuicheng Yan Sea AI Lab EMAIL |
| Pseudocode | Yes | Algorithm 1 Mutual Information Regularized Offline RL Input: Initialize Q network Qϕ, policy network πθ, dataset D, hyperparameters α1 and α2. for t {1, . . . , MAX_STEP} do Train the Q network by gradient descent with objective JQ(ϕ) in Eqn. 12: ϕ := ϕ ηQ ϕJQ(ϕ) Improve policy network by gradient ascent with object Jπ(θ) in Eqn. 13: θ := θ + ηπ θEs D,a πθ(a|s)[Qϕ(s, a)] + α2 θIMISA end Output: The well-trained πθ. |
| Open Source Code | Yes | Our code is attached and will be released upon publication. |
| Open Datasets | Yes | Our code is implemented in JAX [7] with Flax [19]. |
| Dataset Splits | No | The paper refers to using specific datasets (e.g., D4RL benchmark, antmaze-v0, gym-locomotion-v2) but does not provide explicit train/validation/test splits by percentages or sample counts in the main text. It mentions 'average the mean returns over 10 evaluation trajectories and 5 random seeds' and 'evaluate the antmaze-v0 environments for 100 episodes instead', which implies evaluation on a test set, but no split details. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA 3090 GPUs. |
| Software Dependencies | No | The paper mentions software like JAX [7] and Flax [19], and base RL algorithms like SAC [17], but does not provide specific version numbers for these software dependencies (e.g., 'JAX 0.3.17' or 'Flax 0.6.9'). |
| Experiment Setup | Yes | We use ELU activation function [11] and SAC [17] as the base RL algorithm. Besides, we use a learning rate of 1 10 4 for both the policy network and Q-value network with a cosine learning rate scheduler. When approximating Eπθ(a|s) e Tψ(s,a) , we use 50 Monte-Carlo samples. |