An Off-policy Policy Gradient Theorem Using Emphatic Weightings
Authors: Ehsan Imani, Eric Graves, Martha White
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments We empirically investigate the utility of using the true off-policy gradient, as opposed to the previous approximation used by Off PAC; the impact of the choice of λa; and the efficacy of estimating emphatic weightings in ACE. We present a toy problem to highlight the fact that Off PAC which uses an approximate semi-gradient can converge to suboptimal solutions, even in ideal conditions, whereas ACE with the true gradient converges to the optimal solution. |
| Researcher Affiliation | Academia | Ehsan Imani , Eric Graves , Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta {imani,graves,whitem}@ualberta.ca |
| Pseudocode | Yes | We provide the complete Actor-Critic with Emphatic weightings (ACE) algorithm, with pseudo-code and additional algorithm details, in Appendix B. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or a link to a code repository. |
| Open Datasets | No | The paper describes custom-designed 'toy problems' and 'environments' for its experiments (e.g., 'a world with aliased states', 'a three-state MDP', 'continuous action MDP') rather than using or providing access information for publicly available datasets. |
| Dataset Splits | No | The paper uses custom-designed 'toy problems' and 'environments' where data is generated through interaction, rather than pre-existing datasets with explicit train/validation/test splits. Therefore, no dataset split information is provided. |
| Hardware Specification | No | The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | The actor has a softmax output on a linear transformation of features and is trained with a step-size of 0.1 (though results were similar across all the stepsizes tested). The actor s step-size is picked from {5 10 5, 10 4, 2 10 4, 5 10 4, 10 3, 2 10 3, 5 10 3, 10 2}. The step-size v was chosen from {10 5, 10 4, 10 3, 10 2, 10 1, 100}, w was chosen from {10 10, 10 8, 10 6, 10 4, 10 2}, and {0, 0.5, 1.0} was the set of candidate values of λ for the critic. All actors are initialized with zero weights. |