Non-Crossing Quantile Regression for Distributional Reinforcement Learning
Authors: Fan Zhou, Jianing Wang, Xingdong Feng
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Atari 2600 Games show that some state-of-art DRL algorithms with the non-crossing modification can significantly outperform their baselines in terms of faster convergence speeds and better testing performance. |
| Researcher Affiliation | Academia | Fan Zhou, Jianing Wang, Xingdong Feng School of Statistics and Management Shanghai University of Finance and Economics zhoufan@mail.shufe.edu.cn;jianing.wang@163.sufe.edu.cn; feng.xingdong@mail.shufe.edu.cn |
| Pseudocode | No | No explicitly labeled pseudocode or algorithm blocks were found. |
| Open Source Code | No | The paper does not provide an explicit statement about open-sourcing the code or a link to a code repository. |
| Open Datasets | Yes | We test our method on the full Atari-57 benchmark. We follow all the parameter settings of [5] and initialize the learning rate to be 5 × 10−5 at the training stage. |
| Dataset Splits | No | The paper mentions '200 million training frames' and 'testing scores' but does not specify the exact percentages or counts for training, validation, or test splits. It implicitly relies on the standard setup for Atari-57 without detailing it. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts) used for running experiments were mentioned. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | We set the number of quantiles N to be 200 and evaluate both algorithms on 200 million training frames. We follow all the parameter settings of [5] and initialize the learning rate to be 5 × 10−5 at the training stage. For the exploration set-up, we set the bonus rate ct in (25) to be 50 p log t/t which decays with the training step t. For both algorithms, we set κ = 1 for the Huber quantile loss in (22) due to its smoothness. |