Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Temporal-Difference Approach to Policy Gradient Estimation
Authors: Samuele Tosatto, Andrew Patterson, Martha White, Rupam Mahmood
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Alberta, Edmonton, Canada 2CIFAR AI Chair, Alberta Machine Intelligence Institute (Amii). Correspondence to: Samuele Tosatto <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 details a pseudocode of TDRC with policy improvement. ... Algorithm 1 TDRCΓ ... Algorithm 2 Policy Gradient with LSTDΓ |
| Open Source Code | Yes | The implementation of the experiments can be found at https://github.com/SamuelePolimi/temporaldifference-gradient. |
| Open Datasets | Yes | Imani s MDP (Figure 1a) is designed to show the fallacy of semi-gradient methods under off-policy distribution. In their work, Imani et al. assumed a perfect critic but aliased states for the actor. |
| Dataset Splits | No | The paper mentions generating datasets and performing runs for bias/variance analysis and performance evaluation, but it does not specify explicit train/validation/test splits, only dataset sizes or parameters for analysis. |
| Hardware Specification | No | The paper mentions "a modern desktop processor" but does not specify any particular CPU or GPU models, memory, or other detailed hardware specifications for running the experiments. |
| Software Dependencies | No | The paper mentions "Adam optimizer" and the use of "automatic differentiation via pytorch" but does not provide specific version numbers for any software libraries, frameworks, or dependencies. |
| Experiment Setup | Yes | We swept the hyperparameters for both the Gradient Actor-Critic and Actor-Critic baseline, selecting the maximizing hyperparameter setting using 30 random seeds. The swept hyperparameters are reported in the table below. ... Optimizer ADAM(β1 = 0.9, β2 = 0.999) ... Learning rate for the critic {0.1, 0.01, 0.001, 0.0001} ... Learning rate for the actor {0.1, 0.01, 0.001, 0.0001} ... Eligibility Trace {0.9, 0.75, 0.5, 0.1} ... Adam s learning rate is set to 0.001, and the optimization takes 5000 steps. We used β = 1 as regularization factor and constant learning rate for both critic and gradient critic α = 0.1. |