reproducibilityindex.ai

The In-Sample Softmax for Offline Reinforcement Learning

Authors: Chenjun Xiao, Han Wang, Yangchen Pan, Adam White, Martha White

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we investigate three primary questions. First, in the tabular setting, can our algorithm In AC converge to a policy found by an oracle method that exactly eliminates out-of-distribution (OOD) actions when bootstrapping? Second, in Mujoco benchmarks, how does our algorithm compare with several baselines using different ofﬂine datasets with different coverage? Third, how does In AC compare with other baselines when used for online ﬁne-tuning after ofﬂine training? We refer readers to Appendix B for additional details and supplementary experiments.
Researcher Affiliation	Collaboration	1 Huawei Noah s Ark Lab 2 University of Alberta; Alberta Machine Intelligence Institute (Amii) 3 University of Oxford
Pseudocode	No	No explicit pseudocode or algorithm block was found.
Open Source Code	Yes	We release the code at github.com/hwang-ua/inac pytorch.
Open Datasets	Yes	In continuous control tasks, we used the datasets provided by D4RL.
Dataset Splits	No	The paper uses D4RL datasets and other collected datasets but does not explicitly state the train/validation/test splits or refer to predefined splits in a way that provides concrete information for reproduction, nor does it explicitly mention "validation" data split.
Hardware Specification	No	No explicit hardware specifications (e.g., specific GPU/CPU models, memory amounts) were provided for running the experiments.
Software Dependencies	Yes	We use python version 3.9.6, gym version 0.10.0, pytorch version 1.10.0.
Experiment Setup	Yes	Network architecture: In mujoco tasks, we used 2 hidden layers with 256 nodes each for all neural networks. In discrete action environments, we used 2 hidden layers with 64 nodes each. Ofﬂine training details: In all tasks, we used minibatch sampling, and the mini-batch size was set to 100. We used the ADAM optimizer and Re LU activation function. The target network is updated by using Polyak average: 0.995 target weight + 0.005 learning weight. We trained the agent for 0.8 million iterations and 70k iterations in mujoco and discrete action environments respectively. Algorithm parameter setting: Mujoco tasks: For all algorithms, the learning rate was swept in {3 10 4, 1 10 4, 3 10 5}. In AC swept τ in {1.0, 0.5, 0.33, 0.1, 0.01}. AWAC swept λ in {1.0, 0.5, 0.33, 0.1, 0.01}. IQL swept expectile in {0.9, 0.7} and temperature in {10.0, 3.0}. The number came from what was reported in the original IQL paper. TD3+BC used α = 2.5 as in the original paper. CQL-SAC used automatic entropy tuning as in the original paper. Discrete action environments: For all algorithms, the learning rate was swept in {0.003, 0.001, 0.0003, 0.0001, 3e 5, 1e 5}. For In AC, τ was swept in {1.0, 0.5, 0.1, 0.05, 0.01}. IQL had the same parameter sweeping range as in mujoco tasks. AWAC used λ = 1.0 as in the original paper. CQL used α = 5.0.