reproducibilityindex.ai

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Authors: Ehsan Imani, Eric Graves, Martha White

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments We empirically investigate the utility of using the true off-policy gradient, as opposed to the previous approximation used by Off PAC; the impact of the choice of λa; and the efﬁcacy of estimating emphatic weightings in ACE. We present a toy problem to highlight the fact that Off PAC which uses an approximate semi-gradient can converge to suboptimal solutions, even in ideal conditions, whereas ACE with the true gradient converges to the optimal solution.
Researcher Affiliation	Academia	Ehsan Imani , Eric Graves , Martha White Reinforcement Learning and Artiﬁcial Intelligence Laboratory Department of Computing Science University of Alberta {imani,graves,whitem}@ualberta.ca
Pseudocode	Yes	We provide the complete Actor-Critic with Emphatic weightings (ACE) algorithm, with pseudo-code and additional algorithm details, in Appendix B.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or a link to a code repository.
Open Datasets	No	The paper describes custom-designed 'toy problems' and 'environments' for its experiments (e.g., 'a world with aliased states', 'a three-state MDP', 'continuous action MDP') rather than using or providing access information for publicly available datasets.
Dataset Splits	No	The paper uses custom-designed 'toy problems' and 'environments' where data is generated through interaction, rather than pre-existing datasets with explicit train/validation/test splits. Therefore, no dataset split information is provided.
Hardware Specification	No	The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment.
Experiment Setup	Yes	The actor has a softmax output on a linear transformation of features and is trained with a step-size of 0.1 (though results were similar across all the stepsizes tested). The actor s step-size is picked from {5 10 5, 10 4, 2 10 4, 5 10 4, 10 3, 2 10 3, 5 10 3, 10 2}. The step-size v was chosen from {10 5, 10 4, 10 3, 10 2, 10 1, 100}, w was chosen from {10 10, 10 8, 10 6, 10 4, 10 2}, and {0, 0.5, 1.0} was the set of candidate values of λ for the critic. All actors are initialized with zero weights.