The In-Sample Softmax for Offline Reinforcement Learning
Authors: Chenjun Xiao, Han Wang, Yangchen Pan, Adam White, Martha White
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we investigate three primary questions. First, in the tabular setting, can our algorithm In AC converge to a policy found by an oracle method that exactly eliminates out-of-distribution (OOD) actions when bootstrapping? Second, in Mujoco benchmarks, how does our algorithm compare with several baselines using different offline datasets with different coverage? Third, how does In AC compare with other baselines when used for online fine-tuning after offline training? We refer readers to Appendix B for additional details and supplementary experiments. |
| Researcher Affiliation | Collaboration | 1 Huawei Noah s Ark Lab 2 University of Alberta; Alberta Machine Intelligence Institute (Amii) 3 University of Oxford |
| Pseudocode | No | No explicit pseudocode or algorithm block was found. |
| Open Source Code | Yes | We release the code at github.com/hwang-ua/inac pytorch. |
| Open Datasets | Yes | In continuous control tasks, we used the datasets provided by D4RL. |
| Dataset Splits | No | The paper uses D4RL datasets and other collected datasets but does not explicitly state the train/validation/test splits or refer to predefined splits in a way that provides concrete information for reproduction, nor does it explicitly mention "validation" data split. |
| Hardware Specification | No | No explicit hardware specifications (e.g., specific GPU/CPU models, memory amounts) were provided for running the experiments. |
| Software Dependencies | Yes | We use python version 3.9.6, gym version 0.10.0, pytorch version 1.10.0. |
| Experiment Setup | Yes | Network architecture: In mujoco tasks, we used 2 hidden layers with 256 nodes each for all neural networks. In discrete action environments, we used 2 hidden layers with 64 nodes each. Offline training details: In all tasks, we used minibatch sampling, and the mini-batch size was set to 100. We used the ADAM optimizer and Re LU activation function. The target network is updated by using Polyak average: 0.995 target weight + 0.005 learning weight. We trained the agent for 0.8 million iterations and 70k iterations in mujoco and discrete action environments respectively. Algorithm parameter setting: Mujoco tasks: For all algorithms, the learning rate was swept in {3 10 4, 1 10 4, 3 10 5}. In AC swept τ in {1.0, 0.5, 0.33, 0.1, 0.01}. AWAC swept λ in {1.0, 0.5, 0.33, 0.1, 0.01}. IQL swept expectile in {0.9, 0.7} and temperature in {10.0, 3.0}. The number came from what was reported in the original IQL paper. TD3+BC used α = 2.5 as in the original paper. CQL-SAC used automatic entropy tuning as in the original paper. Discrete action environments: For all algorithms, the learning rate was swept in {0.003, 0.001, 0.0003, 0.0001, 3e 5, 1e 5}. For In AC, τ was swept in {1.0, 0.5, 0.1, 0.05, 0.01}. IQL had the same parameter sweeping range as in mujoco tasks. AWAC used λ = 1.0 as in the original paper. CQL used α = 5.0. |