Safe Reinforcement Learning with Linear Function Approximation
Authors: Sanae Amani, Christos Thrampoulidis, Lin Yang
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present numerical simulations to complement and confirm our theoretical findings. We evaluate the performance of SLUCB-QVI on synthetic environments and implement RSLUCB-QVI on the Frozen Lake environment from Open AI Gym (Brockman et al., 2016). Figure 1 depicts the average per-episode reward of SLUCB-QVI and compares it to that of baseline and emphasizes the value of SLUCB-QVI in terms of respecting the safety constraints at all time-steps. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, University of California, Los Angeles. 2Department of Electrical and Computer Engineering, University of British Columbia, Vancouver. |
| Pseudocode | Yes | Algorithm 1 SLUCB-QVI and Algorithm 2 RSLUCB-QVI |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the methodology described is publicly available. |
| Open Datasets | Yes | We evaluate the performance of RSLUCB-QVI in the Frozen Lake environment from Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper describes episodic interactions within environments (synthetic and Frozen Lake) for learning, and evaluates performance over multiple realizations or agents. It does not provide specific pre-defined train/validation/test dataset splits in terms of percentages or sample counts, which is typical for online reinforcement learning where data is generated through interaction. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Open AI Gym' but does not specify its version number or any other software dependencies with version numbers. |
| Experiment Setup | Yes | The results shown in Figure 1 depict averages over 20 realizations, for which we have chosen δ = 0.01, σ = 0.01, λ = 1, d = 5, τ = 0.5, H = 3 and K = 10000. We set H = 1000, K = 10, d = |S|= 100, and µ (s) N(0, Id) for all s S = {s1, . . . , s100}. We then properly specified the feature map φ(s, a)... In order to interpret the requirement of avoiding dangers as a constraint of form (11), we tuned γ and τ as follows: the cost of playing action a A at state s S is the probability of the agent moving to one of the danger states. Therefore a safe policy insures that the expected value of probability of moving to a danger state is a small value. To this end, we set γ = P s Danger states µ (s) and τ = 0.1. Also, for each state s S a safe action, playing which leads to one of the danger states with small probability (τ = 0.1) is given to the agent. We solve a set of linear equations to tune θ such that at each state s S, the direction which leads to a state that is closest to the goal state gives the agent a reward 1, while playing other three directions gives it a reward 0.01. This model persuades the agent to move towards to the goal. |