Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On Generalized Bellman Equations and Temporal-Difference Learning
Authors: Huizhen Yu, A. Rupam Mahmood, Richard S. Sutton
JMLR 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In addition to the theoretical study, we also present the results from a preliminary numerical study that compares several ways of setting λ for the least-squares based off-policy algorithm. The results demonstrate the advantages of the proposed new scheme with its greater flexibility. [...] 4. Numerical Study In this section, we first use a toy problem to illustrate the behavior of traces calculated by off-policy LSTD(λ) for constant λ and for λ that evolves according to a simple special case of our proposed scheme described in Example 2.1. We then compare the behavior of LSTD for various choices of λ, on the toy problem and on the Mountain Car problem. |
| Researcher Affiliation | Collaboration | Huizhen Yu EMAIL Reinforcement Learning and Artificial Intelligence Group Department of Computing Science, University of Alberta Edmonton, AB, T6G 2E8, Canada. A. Rupam Mahmood EMAIL Kindred Inc. 243 College St Toronto, ON M5T 1R5, Canada. Richard S. Sutton EMAIL Reinforcement Learning and Artificial Intelligence Group Department of Computing Science, University of Alberta Edmonton, AB, T6G 2E8, Canada |
| Pseudocode | No | The paper describes methods and equations in text and mathematical notation (e.g., (2.2), (2.3), (2.4)), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about the release of source code for the described methodology, nor does it provide links to any code repositories. |
| Open Datasets | Yes | In this subsection we demonstrate LSTD with evolving λ on a problem adapted from the well-known Mountain Car problem (Sutton and Barto, 1998). ... The dynamics is as given in (Sutton and Barto, 1998). |
| Dataset Splits | No | The paper describes how data is generated from a behavior policy in a simulated environment and how value functions are evaluated on a grid of points, but it does not specify explicit training/test/validation dataset splits (e.g., percentages or sample counts) for a fixed dataset, which is common in supervised learning. Instead, it uses an online learning approach by running simulations for a specified number of iterations. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions the use of 'tile-coding (Sutton and Barto, 1998)' for feature generation but does not specify any software libraries or frameworks with their version numbers (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | We ran LSTD for C = 10, 20, ..., 100, using the same trajectory, for 3 × 10^5 iterations, and we computed the (Euclidean) distance of these LSTD solutions to the asymptotic TD(1) solution (in the space of the θ-parameters), normalized by the norm of the latter. We then repeat this calculation 10 times, each time with an independently generated trajectory. ... The discount factor is γ = 0.9 for all states. ... We use tile-coding (Sutton and Barto, 1998) to generate 145 binary features for our experiments. ... We ran LSTD with different ways of setting λ just mentioned, on the same state trajectory generated by the behavior policy, for 6 × 10^5 effective iterations ... we consider (2.10)-(2.11) with parameters β = 0.9, K ∈ {1.5, 2.0, 2.5, 3.0} and C ∈ {50, 125}. |