Risk-Aware Reinforcement Learning with Coherent Risk Measures and Non-linear Function Approximation

Authors: Thanh Lam, Arun Verma, Bryan Kian Hsiang Low, Patrick Jaillet

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we validate our theoretical results via empirical experiments on synthetic and real-world data.
Researcher Affiliation Academia Department of Computer Science, National University of Singapore, Republic of Singapore Department of Electrical Engineering and Computer Science, MIT, USA {chithanh, arun, lowkh}@comp.nus.edu.sg jaillet@mit.edu
Pseudocode Yes RA-UCB Risk-Aware Upper Confidence Bound 1:Input: Hyperparameters of coherent risk measure ρ (e.g., confidence level α (0, 1) for CVa R) 2: for episode t = 1, 2, . . . , T do 3: Receive the initial state xt 1 and initialize V t H+1 as the zero function. 4: for step h = H, . . . , 1 do 5: For τ [t 1], draw m samples from the weak simulator and construct the response vector yt h using Eq. (7). 6: Compute µt h and σt h using Eq. (8). 7: Compute Qt h and V t h using Eq. (9). 8: end for 9: for step h = 1, . . . , H do 10: Take action at h arg max a A Qt h(xt h, a). 11: Observe reward rh(xt h, at h) and the next state xt h+1. 12: end for 13: end for
Open Source Code Yes The code for these experiments is available in the supplementary material.
Open Datasets No The paper mentions "synthetic and real-world data" and states that the trading environment is "based on real historical exchange rates and volumes between EUR and USD" and customized from "Forex Env in the python package gym-anytrading.6". It does not provide direct access links, DOIs, or citations to specific public datasets used for training.
Dataset Splits No The paper does not provide specific details on training, validation, and test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not mention any specific hardware specifications (e.g., GPU/CPU models, memory, or cloud resources) used for running the experiments.
Software Dependencies No The paper mentions using "the RBF kernel and the Kernel Ridge regressor from Scikit-learn" and customizing an environment based on "the python package gym-anytrading" but does not specify version numbers for these software components.
Experiment Setup Yes We set the horizon of each episode to H = 30. ... In this experiment, we use m = 100 samples from the weak simulator to estimate the risk in Eq. (7). ... The robot does not know perturbation parameters (r = 0.3) and the obstacles positions, so it has to learn them online via interacting with the environment.