Distributional Reinforcement Learning with Regularized Wasserstein Loss
Authors: Ke Sun, Yingnan Zhao, Wulong Liu, Bei Jiang, Linglong Kong
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that Sinkhorn DRL consistently outperforms or matches existing algorithms on the Atari games suite and particularly stands out in the multi-dimensional reward setting. |
| Researcher Affiliation | Collaboration | 1University of Alberta, Edmonton, Canada 2 Harbin Engineering University, China 3Huawei Noah s Ark Lab |
| Pseudocode | Yes | Algorithm 1 Generic Sinkhorn distributional RL Update; Algorithm 2 Sinkhorn Iterations to Approximate Wc,ε; Algorithm 3 Sinkhorn Distributional RL |
| Open Source Code | Yes | Code is available in https://github.com/datake/Sinkhorn Dist RL. |
| Open Datasets | Yes | We substantiate the effectiveness of Sinkhorn DRL as described in Algorithm 1 on the entire 55 Atari 2600 games. |
| Dataset Splits | No | The paper states that algorithms are evaluated over 40M training frames and results are averaged over three seeds, but it does not specify explicit train/validation/test dataset splits. |
| Hardware Specification | Yes | We run our experiments on multiple NVIDIA 3090 Ti GPUs |
| Software Dependencies | No | The paper mentions building algorithms based on a 'well-accepted Py Torch implementation' and re-implementing MMD-DQN based on its 'original Tensor Flow implementation', but it does not specify version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | For a fair comparison with QR-DQN, C51, and MMD-DQN, we use the same hyperparameters: the number of generated samples N = 200, Adam optimizer with lr = 0.00005, ϵAdam = 0.01/32. In Sinkhorn DRL, we choose the number of Sinkhorn iterations L = 10 and smoothing hyperparameter ε = 10.0 in Section 5.1 after conducting sensitivity analysis in Section 5.2. Guided by the contraction guarantee analyzed in Theorem 1, we use the unrectified kernel, specifically setting c = kα and choosing α = 2. We evaluate all algorithms on 55 Atari games, averaging results over three seeds. |