reproducibilityindex.ai

Variance Penalized On-Policy and Off-Policy Actor-Critic

Authors: Arushi Jain, Gandharv Patil, Ayush Jain, Khimya Khetarpal, Doina Precup7899-7907

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the utility of our algorithm in tabular and continuous Mu Jo Co domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return. Experiments We present an empirical analysis in both discrete and continuous environments for the proposed on-policy and offpolicy VPAC algorithms. We compare our algorithms with two baselines: AC and VAAC, an existing variance penalized actor-critic algorithm using an indirect variance estimator (Tamar and Mannor 2013) (refer Related Work section for further details).
Researcher Affiliation	Collaboration	Arushi Jain,1, 2 Gandharv Patil, 1, 2 Ayush Jain, 1, 2 Khimya Khetarpal,1, 2 Doina Precup 1, 2, 3 1 Mc Gill University, Montreal 2 Mila, Montreal 3 Google Deep Mind, Montreal
Pseudocode	Yes	Algorithm 1: On-policy VPAC
Open Source Code	Yes	Code for all the experiments is available on https://github. com/arushi12130/Variance Penalized Actor Critic.git
Open Datasets	Yes	We demonstrate the utility of our algorithm in tabular and continuous Mu Jo Co domains. Implementation details along with the hyperparameters used for all the experiments2 are provided in Appendix E. We modify the classic four rooms (FR) environment (Sutton, Precup, and Singh 1999) to include a patch of frozen states (see Fig. 1) with stochastic reward. We now turn to continuous state-action tasks in the Mu Jo Co Open AI Gym (Brockman et al. 2016).
Dataset Splits	No	The paper does not specify explicit training, validation, or test dataset splits. It describes experiments in simulated environments and evaluates performance based on rolled-out trajectories and converged policies.
Hardware Specification	No	The paper mentions using 'Mu Jo Co Open AI Gym' for continuous state-action tasks but does not specify any particular hardware (CPU, GPU models, memory, etc.) used for running the experiments.
Software Dependencies	No	The paper mentions software components like PPO, AC, and VAAC but does not specify their version numbers or other software dependencies with versions required for reproducibility.
Experiment Setup	Yes	Implementation details along with the hyperparameters used for all the experiments2 are provided in Appendix E. We use Boltzmann exploration and do a grid search to ﬁnd the best hyperparameters for all algorithms, where the least variance is used to break ties among policies with maximum mean performance. We also show the table of best hyperparameters (found using grid search) used for all algorithms in Appendix E.