Regularized Q-learning through Robust Averaging
Authors: Peter Schmitt-Förster, Tobias Sutter
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods. |
| Researcher Affiliation | Academia | 1Department of Computer and Information Science, University of Konstanz, Germany. Correspondence to: Peter Schmitt-F orster <peter.schmitt-foerster@uni-konstanz.de>. |
| Pseudocode | No | The paper describes the update rules mathematically (e.g., equation (7)) but does not include a formal pseudocode or algorithm block. |
| Open Source Code | Yes | Here: github.com/2RAQ/code |
| Open Datasets | Yes | Lastly, we conduct numerical experiments for various settings... In more practical experiments from the Open AI gym suite (Brockman et al., 2016) we show that, even when implementations require deviations from out theoretically required assumptions, 2RA Q-learning has good performance and mostly outperforms other Q-learning variants. |
| Dataset Splits | No | The paper mentions training episodes and evaluation, but does not provide explicit training, validation, or test dataset splits. For reinforcement learning environments, data is typically generated through interaction rather than being split from a fixed dataset. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like "Tensorflow (Abadi et al., 2015)", "Huber loss (Huber, 1964)", and "Adam optimizer (Kingma & Ba, 2015)", but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | All Methods use an initial learning rate of α0 = 0.01, wα = 10^5, and γ = 0.8. All 2RA agents additionally use wρ = 10^3. The reward function has values random-uniformly sampled from [−0.05, 0.05]. |