Variance Penalized On-Policy and Off-Policy Actor-Critic
Authors: Arushi Jain, Gandharv Patil, Ayush Jain, Khimya Khetarpal, Doina Precup7899-7907
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the utility of our algorithm in tabular and continuous Mu Jo Co domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return. Experiments We present an empirical analysis in both discrete and continuous environments for the proposed on-policy and offpolicy VPAC algorithms. We compare our algorithms with two baselines: AC and VAAC, an existing variance penalized actor-critic algorithm using an indirect variance estimator (Tamar and Mannor 2013) (refer Related Work section for further details). |
| Researcher Affiliation | Collaboration | Arushi Jain,1, 2 Gandharv Patil, 1, 2 Ayush Jain, 1, 2 Khimya Khetarpal,1, 2 Doina Precup 1, 2, 3 1 Mc Gill University, Montreal 2 Mila, Montreal 3 Google Deep Mind, Montreal |
| Pseudocode | Yes | Algorithm 1: On-policy VPAC |
| Open Source Code | Yes | Code for all the experiments is available on https://github. com/arushi12130/Variance Penalized Actor Critic.git |
| Open Datasets | Yes | We demonstrate the utility of our algorithm in tabular and continuous Mu Jo Co domains. Implementation details along with the hyperparameters used for all the experiments2 are provided in Appendix E. We modify the classic four rooms (FR) environment (Sutton, Precup, and Singh 1999) to include a patch of frozen states (see Fig. 1) with stochastic reward. We now turn to continuous state-action tasks in the Mu Jo Co Open AI Gym (Brockman et al. 2016). |
| Dataset Splits | No | The paper does not specify explicit training, validation, or test dataset splits. It describes experiments in simulated environments and evaluates performance based on rolled-out trajectories and converged policies. |
| Hardware Specification | No | The paper mentions using 'Mu Jo Co Open AI Gym' for continuous state-action tasks but does not specify any particular hardware (CPU, GPU models, memory, etc.) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like PPO, AC, and VAAC but does not specify their version numbers or other software dependencies with versions required for reproducibility. |
| Experiment Setup | Yes | Implementation details along with the hyperparameters used for all the experiments2 are provided in Appendix E. We use Boltzmann exploration and do a grid search to find the best hyperparameters for all algorithms, where the least variance is used to break ties among policies with maximum mean performance. We also show the table of best hyperparameters (found using grid search) used for all algorithms in Appendix E. |