Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Regularized Q-learning through Robust Averaging
Authors: Peter Schmitt-Förster, Tobias Sutter
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods. |
| Researcher Affiliation | Academia | 1Department of Computer and Information Science, University of Konstanz, Germany. Correspondence to: Peter Schmitt-F orster <EMAIL>. |
| Pseudocode | No | The paper describes the update rules mathematically (e.g., equation (7)) but does not include a formal pseudocode or algorithm block. |
| Open Source Code | Yes | Here: github.com/2RAQ/code |
| Open Datasets | Yes | Lastly, we conduct numerical experiments for various settings... In more practical experiments from the Open AI gym suite (Brockman et al., 2016) we show that, even when implementations require deviations from out theoretically required assumptions, 2RA Q-learning has good performance and mostly outperforms other Q-learning variants. |
| Dataset Splits | No | The paper mentions training episodes and evaluation, but does not provide explicit training, validation, or test dataset splits. For reinforcement learning environments, data is typically generated through interaction rather than being split from a fixed dataset. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like "Tensorflow (Abadi et al., 2015)", "Huber loss (Huber, 1964)", and "Adam optimizer (Kingma & Ba, 2015)", but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | All Methods use an initial learning rate of α0 = 0.01, wα = 10^5, and γ = 0.8. All 2RA agents additionally use wρ = 10^3. The reward function has values random-uniformly sampled from [−0.05, 0.05]. |