The In-Sample Softmax for Offline Reinforcement Learning

Authors: Chenjun Xiao, Han Wang, Yangchen Pan, Adam White, Martha White

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we investigate three primary questions. First, in the tabular setting, can our algorithm In AC converge to a policy found by an oracle method that exactly eliminates out-of-distribution (OOD) actions when bootstrapping? Second, in Mujoco benchmarks, how does our algorithm compare with several baselines using different offline datasets with different coverage? Third, how does In AC compare with other baselines when used for online fine-tuning after offline training? We refer readers to Appendix B for additional details and supplementary experiments.
Researcher Affiliation Collaboration 1 Huawei Noah s Ark Lab 2 University of Alberta; Alberta Machine Intelligence Institute (Amii) 3 University of Oxford
Pseudocode No No explicit pseudocode or algorithm block was found.
Open Source Code Yes We release the code at github.com/hwang-ua/inac pytorch.
Open Datasets Yes In continuous control tasks, we used the datasets provided by D4RL.
Dataset Splits No The paper uses D4RL datasets and other collected datasets but does not explicitly state the train/validation/test splits or refer to predefined splits in a way that provides concrete information for reproduction, nor does it explicitly mention "validation" data split.
Hardware Specification No No explicit hardware specifications (e.g., specific GPU/CPU models, memory amounts) were provided for running the experiments.
Software Dependencies Yes We use python version 3.9.6, gym version 0.10.0, pytorch version 1.10.0.
Experiment Setup Yes Network architecture: In mujoco tasks, we used 2 hidden layers with 256 nodes each for all neural networks. In discrete action environments, we used 2 hidden layers with 64 nodes each. Offline training details: In all tasks, we used minibatch sampling, and the mini-batch size was set to 100. We used the ADAM optimizer and Re LU activation function. The target network is updated by using Polyak average: 0.995 target weight + 0.005 learning weight. We trained the agent for 0.8 million iterations and 70k iterations in mujoco and discrete action environments respectively. Algorithm parameter setting: Mujoco tasks: For all algorithms, the learning rate was swept in {3 10 4, 1 10 4, 3 10 5}. In AC swept τ in {1.0, 0.5, 0.33, 0.1, 0.01}. AWAC swept λ in {1.0, 0.5, 0.33, 0.1, 0.01}. IQL swept expectile in {0.9, 0.7} and temperature in {10.0, 3.0}. The number came from what was reported in the original IQL paper. TD3+BC used α = 2.5 as in the original paper. CQL-SAC used automatic entropy tuning as in the original paper. Discrete action environments: For all algorithms, the learning rate was swept in {0.003, 0.001, 0.0003, 0.0001, 3e 5, 1e 5}. For In AC, τ was swept in {1.0, 0.5, 0.1, 0.05, 0.01}. IQL had the same parameter sweeping range as in mujoco tasks. AWAC used λ = 1.0 as in the original paper. CQL used α = 5.0.