Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning

Authors: Shangtong Zhang, Bo Liu, Shimon Whiteson10905-10913

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental All curves in this section are averaged over 10 independent runs with shaded regions indicate standard errors. All implementations are publicly available.1 We report the mean of those 20 episodic returns against the training steps in Figure 1. The curves are generated by setting λ = 1. More details are provided in the appendix.
Researcher Affiliation Academia 1 University of Oxford 2 Auburn University
Pseudocode Yes Algorithm 1: Mean-Variance Policy Iteration (MVPI) and Algorithm 2: Off-line MVPI
Open Source Code Yes All implementations are publicly available.1 1https://github.com/ShangtongZhang/Deep RL
Open Datasets Yes We benchmark MVPI-TD3 on eight Mujoco robot manipulation tasks from Open AI gym.
Dataset Splits No No specific train/validation/test dataset splits (percentages or sample counts) are mentioned in the paper for the Mujoco tasks, which are continuous environments. Evaluation is described as 'evaluate the algorithm every 10^4 steps for 20 episodes'.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided in the paper.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) are explicitly mentioned in the paper.
Experiment Setup Yes We run each algorithm for 10^6 steps and evaluate the algorithm every 10^4 steps for 20 episodes. We use two-hidden-layer neural networks for function approximation. In the policy evaluation step of MVPI-TD3, we set yk+1 to the average of the recent K rewards, where K is a hyperparameter to be tuned... The curves are generated by setting λ = 1. More details are provided in the appendix.