Distributional Model Equivalence for Risk-Sensitive Reinforcement Learning
Authors: Tyler Kastner, Murat A. Erdogdu, Amir-massoud Farahmand
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our framework empirically in both tabular and continuous domains (Section 7).7 Empirical evaluation We now empirically study our framework, and examine the phenomena discussed in the previous sections. We focus on two sets of experiments: the first is in tabular settings where we use dynamic programming methods to perform an analysis without the noise of gradient-based learning. The second builds upon Lim & Malik (2022), where we augment their model-free algorithm with our framework, and evaluate it on an option trading environment. We discuss training and environments details in Appendix E, and provide additional results in Appendix A.3. |
| Researcher Affiliation | Academia | Tyler Kastner University of Toronto, Vector Institute Murat A. Erdogdu University of Toronto, Vector Institute Amir-massoud Farahmand Vector Institute, University of Toronto |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide the code used to run our experiments at github.com/tylerkastner/distribution-equivalence. |
| Open Datasets | Yes | We adapt the stochastic four rooms domain used in Grimm et al. (2021)... We consider the stochastic adaptation of the cliff walk environment (Sutton & Barto, 2018) as introduced in Bellemare et al. (2023)... We use the 8 by 8 frozen lake domain as specified in Brockman et al. (2016). |
| Dataset Splits | No | The paper mentions training and evaluation but does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) needed to reproduce the data partitioning. For example, it does not state what percentage of data was used for training versus validation. |
| Hardware Specification | Yes | For the tabular experiments, each model took roughly 1 hour to train on a single CPU... For the option trading experiments, training a policy for a given CVa R level took roughly 40 minutes on a single Tesla P100 GPU on average... |
| Software Dependencies | No | The paper mentions using specific algorithms and methods (e.g., QR-DQN, C51) and refers to code being provided on GitHub, but it does not specify software dependencies like programming language versions or library versions (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | No | The paper states it discusses training details in Appendix E, but Appendix E primarily describes the environments and compute infrastructure. It does not provide specific experimental setup details such as hyperparameter values (e.g., learning rates, batch sizes, number of epochs) or optimizer settings. |