A Temporal-Difference Approach to Policy Gradient Estimation
Authors: Samuele Tosatto, Andrew Patterson, Martha White, Rupam Mahmood
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Alberta, Edmonton, Canada 2CIFAR AI Chair, Alberta Machine Intelligence Institute (Amii). Correspondence to: Samuele Tosatto <tosatto@ualberta.ca>. |
| Pseudocode | Yes | Algorithm 1 details a pseudocode of TDRC with policy improvement. ... Algorithm 1 TDRCΓ ... Algorithm 2 Policy Gradient with LSTDΓ |
| Open Source Code | Yes | The implementation of the experiments can be found at https://github.com/SamuelePolimi/temporaldifference-gradient. |
| Open Datasets | Yes | Imani s MDP (Figure 1a) is designed to show the fallacy of semi-gradient methods under off-policy distribution. In their work, Imani et al. assumed a perfect critic but aliased states for the actor. |
| Dataset Splits | No | The paper mentions generating datasets and performing runs for bias/variance analysis and performance evaluation, but it does not specify explicit train/validation/test splits, only dataset sizes or parameters for analysis. |
| Hardware Specification | No | The paper mentions "a modern desktop processor" but does not specify any particular CPU or GPU models, memory, or other detailed hardware specifications for running the experiments. |
| Software Dependencies | No | The paper mentions "Adam optimizer" and the use of "automatic differentiation via pytorch" but does not provide specific version numbers for any software libraries, frameworks, or dependencies. |
| Experiment Setup | Yes | We swept the hyperparameters for both the Gradient Actor-Critic and Actor-Critic baseline, selecting the maximizing hyperparameter setting using 30 random seeds. The swept hyperparameters are reported in the table below. ... Optimizer ADAM(β1 = 0.9, β2 = 0.999) ... Learning rate for the critic {0.1, 0.01, 0.001, 0.0001} ... Learning rate for the actor {0.1, 0.01, 0.001, 0.0001} ... Eligibility Trace {0.9, 0.75, 0.5, 0.1} ... Adam s learning rate is set to 0.001, and the optimization takes 5000 steps. We used β = 1 as regularization factor and constant learning rate for both critic and gradient critic α = 0.1. |