reproducibilityindex.ai

A Temporal-Difference Approach to Policy Gradient Estimation

Authors: Samuele Tosatto, Andrew Patterson, Martha White, Rupam Mahmood

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.
Researcher Affiliation	Academia	1Department of Computer Science, University of Alberta, Edmonton, Canada 2CIFAR AI Chair, Alberta Machine Intelligence Institute (Amii). Correspondence to: Samuele Tosatto <tosatto@ualberta.ca>.
Pseudocode	Yes	Algorithm 1 details a pseudocode of TDRC with policy improvement. ... Algorithm 1 TDRCΓ ... Algorithm 2 Policy Gradient with LSTDΓ
Open Source Code	Yes	The implementation of the experiments can be found at https://github.com/SamuelePolimi/temporaldifference-gradient.
Open Datasets	Yes	Imani s MDP (Figure 1a) is designed to show the fallacy of semi-gradient methods under off-policy distribution. In their work, Imani et al. assumed a perfect critic but aliased states for the actor.
Dataset Splits	No	The paper mentions generating datasets and performing runs for bias/variance analysis and performance evaluation, but it does not specify explicit train/validation/test splits, only dataset sizes or parameters for analysis.
Hardware Specification	No	The paper mentions "a modern desktop processor" but does not specify any particular CPU or GPU models, memory, or other detailed hardware specifications for running the experiments.
Software Dependencies	No	The paper mentions "Adam optimizer" and the use of "automatic differentiation via pytorch" but does not provide specific version numbers for any software libraries, frameworks, or dependencies.
Experiment Setup	Yes	We swept the hyperparameters for both the Gradient Actor-Critic and Actor-Critic baseline, selecting the maximizing hyperparameter setting using 30 random seeds. The swept hyperparameters are reported in the table below. ... Optimizer ADAM(β1 = 0.9, β2 = 0.999) ... Learning rate for the critic {0.1, 0.01, 0.001, 0.0001} ... Learning rate for the actor {0.1, 0.01, 0.001, 0.0001} ... Eligibility Trace {0.9, 0.75, 0.5, 0.1} ... Adam s learning rate is set to 0.001, and the optimization takes 5000 steps. We used β = 1 as regularization factor and constant learning rate for both critic and gradient critic α = 0.1.