Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Off-policy Learning With Eligibility Traces: A Survey
Authors: Matthieu Geist, Bruno Scherrer
JMLR 2014 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments suggest that the most standard algorithms on and off-policy LSTD(λ)/LSPE(λ) and TD(λ) if the feature space dimension is too large for a least-squares approach perform the best. ... This section aims at empirically comparing the surveyed algorithms. |
| Researcher Affiliation | Academia | Matthieu Geist EMAIL IMS-Ma LIS Research Group & UMI 2958 (Georgia Tech-CNRS) Supélec 2 rue Edouard Belin 57070 Metz, France. Bruno Scherrer EMAIL MAIA project-team INRIA Lorraine 615 rue du Jardin Botanique 54600 Villers-lès-Nancy, France |
| Pseudocode | Yes | Algorithm 1: Off-policy LSTD(λ)... Algorithm 2: Off-policy LSPE(λ)... Algorithm 3: Off-policy FPKF(λ)... Algorithm 4: Off-policy BRM(λ)... Algorithm 5: Off-policy TD(λ)... Algorithm 6: Off-policy TDC(λ), also known as GQ(λ)... Algorithm 7: Off-policy GTD2(λ)... Algorithm 8: Off-policy g BRM(λ) |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | Yes | More precisely, we consider Garnet problems (Archibald et al., 1995), which are a class of randomly constructed finite MDPs. |
| Dataset Splits | Yes | For each problem, we generate one trajectory of length 10^4 using the behavioral policy... Finally, for each case, for all problems and each algorithm, we choose the combination of meta-parameters which minimizes the average error on the last one-tenth of the averaged (over all problems) learning curves (we do this to reduce the sensitivity to the initialization and the transient behavior). |
| Hardware Specification | No | The paper does not provide any specific hardware details for running its experiments. |
| Software Dependencies | No | The paper does not provide any specific ancillary software details with version numbers. |
| Experiment Setup | Yes | For all algorithms, we choose θ0 = 0. For least-squares algorithms (LSTD, LSPE, FPKF and BRM), we set the initial matrices (M0, N0, C0) to 10^3I... We use the following schedule for the learning rates: αi = α0 αc αc + i and βi = β0 βc βc + i 2 3 . ... For each meta-parameter, we consider the following ranges of values: λ {0, 0.4, 0.7, 0.9, 1}, α0 {10^ 2, 10^ 1, 100}, αc {10^1, 10^2, 10^3}, β0 {10^ 2, 10^ 1, 100} and βc {10^1, 10^2, 10^3}. |