Safe Reinforcement Learning with Linear Function Approximation

Authors: Sanae Amani, Christos Thrampoulidis, Lin Yang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present numerical simulations to complement and confirm our theoretical findings. We evaluate the performance of SLUCB-QVI on synthetic environments and implement RSLUCB-QVI on the Frozen Lake environment from Open AI Gym (Brockman et al., 2016). Figure 1 depicts the average per-episode reward of SLUCB-QVI and compares it to that of baseline and emphasizes the value of SLUCB-QVI in terms of respecting the safety constraints at all time-steps.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, University of California, Los Angeles. 2Department of Electrical and Computer Engineering, University of British Columbia, Vancouver.
Pseudocode Yes Algorithm 1 SLUCB-QVI and Algorithm 2 RSLUCB-QVI
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the methodology described is publicly available.
Open Datasets Yes We evaluate the performance of RSLUCB-QVI in the Frozen Lake environment from Open AI Gym (Brockman et al., 2016).
Dataset Splits No The paper describes episodic interactions within environments (synthetic and Frozen Lake) for learning, and evaluates performance over multiple realizations or agents. It does not provide specific pre-defined train/validation/test dataset splits in terms of percentages or sample counts, which is typical for online reinforcement learning where data is generated through interaction.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions 'Open AI Gym' but does not specify its version number or any other software dependencies with version numbers.
Experiment Setup Yes The results shown in Figure 1 depict averages over 20 realizations, for which we have chosen δ = 0.01, σ = 0.01, λ = 1, d = 5, τ = 0.5, H = 3 and K = 10000. We set H = 1000, K = 10, d = |S|= 100, and µ (s) N(0, Id) for all s S = {s1, . . . , s100}. We then properly specified the feature map φ(s, a)... In order to interpret the requirement of avoiding dangers as a constraint of form (11), we tuned γ and τ as follows: the cost of playing action a A at state s S is the probability of the agent moving to one of the danger states. Therefore a safe policy insures that the expected value of probability of moving to a danger state is a small value. To this end, we set γ = P s Danger states µ (s) and τ = 0.1. Also, for each state s S a safe action, playing which leads to one of the danger states with small probability (τ = 0.1) is given to the agent. We solve a set of linear equations to tune θ such that at each state s S, the direction which leads to a state that is closest to the goal state gives the agent a reward 1, while playing other three directions gives it a reward 0.01. This model persuades the agent to move towards to the goal.