reproducibilityindex.ai

Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning

Authors: Shangtong Zhang, Bo Liu, Shimon Whiteson10905-10913

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	All curves in this section are averaged over 10 independent runs with shaded regions indicate standard errors. All implementations are publicly available.1 We report the mean of those 20 episodic returns against the training steps in Figure 1. The curves are generated by setting λ = 1. More details are provided in the appendix.
Researcher Affiliation	Academia	1 University of Oxford 2 Auburn University
Pseudocode	Yes	Algorithm 1: Mean-Variance Policy Iteration (MVPI) and Algorithm 2: Off-line MVPI
Open Source Code	Yes	All implementations are publicly available.1 1https://github.com/ShangtongZhang/Deep RL
Open Datasets	Yes	We benchmark MVPI-TD3 on eight Mujoco robot manipulation tasks from Open AI gym.
Dataset Splits	No	No specific train/validation/test dataset splits (percentages or sample counts) are mentioned in the paper for the Mujoco tasks, which are continuous environments. Evaluation is described as 'evaluate the algorithm every 10^4 steps for 20 episodes'.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided in the paper.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) are explicitly mentioned in the paper.
Experiment Setup	Yes	We run each algorithm for 10^6 steps and evaluate the algorithm every 10^4 steps for 20 episodes. We use two-hidden-layer neural networks for function approximation. In the policy evaluation step of MVPI-TD3, we set yk+1 to the average of the recent K rewards, where K is a hyperparameter to be tuned... The curves are generated by setting λ = 1. More details are provided in the appendix.