A Temporal-Difference Approach to Policy Gradient Estimation

Authors: Samuele Tosatto, Andrew Patterson, Martha White, Rupam Mahmood

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.
Researcher Affiliation Academia 1Department of Computer Science, University of Alberta, Edmonton, Canada 2CIFAR AI Chair, Alberta Machine Intelligence Institute (Amii). Correspondence to: Samuele Tosatto <tosatto@ualberta.ca>.
Pseudocode Yes Algorithm 1 details a pseudocode of TDRC with policy improvement. ... Algorithm 1 TDRCΓ ... Algorithm 2 Policy Gradient with LSTDΓ
Open Source Code Yes The implementation of the experiments can be found at https://github.com/SamuelePolimi/temporaldifference-gradient.
Open Datasets Yes Imani s MDP (Figure 1a) is designed to show the fallacy of semi-gradient methods under off-policy distribution. In their work, Imani et al. assumed a perfect critic but aliased states for the actor.
Dataset Splits No The paper mentions generating datasets and performing runs for bias/variance analysis and performance evaluation, but it does not specify explicit train/validation/test splits, only dataset sizes or parameters for analysis.
Hardware Specification No The paper mentions "a modern desktop processor" but does not specify any particular CPU or GPU models, memory, or other detailed hardware specifications for running the experiments.
Software Dependencies No The paper mentions "Adam optimizer" and the use of "automatic differentiation via pytorch" but does not provide specific version numbers for any software libraries, frameworks, or dependencies.
Experiment Setup Yes We swept the hyperparameters for both the Gradient Actor-Critic and Actor-Critic baseline, selecting the maximizing hyperparameter setting using 30 random seeds. The swept hyperparameters are reported in the table below. ... Optimizer ADAM(β1 = 0.9, β2 = 0.999) ... Learning rate for the critic {0.1, 0.01, 0.001, 0.0001} ... Learning rate for the actor {0.1, 0.01, 0.001, 0.0001} ... Eligibility Trace {0.9, 0.75, 0.5, 0.1} ... Adam s learning rate is set to 0.001, and the optimization takes 5000 steps. We used β = 1 as regularization factor and constant learning rate for both critic and gradient critic α = 0.1.