Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards

Authors: Semih Cayci, Atilla Eryilmaz

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We corroborate our theoretical results with numerical experiments. ... In this section, we present numerical results for Robust TD learning and its non-robust counterpart.
Researcher Affiliation Academia Semih Cayci Department of Mathematics RWTH Aachen University Aachen, Germany cayci@mathc.rwth-aachen.de Atilla Eryilmaz Department of Electrical and Computer Engineering The Ohio State University Columbus, OH 43210 eryilmaz.2@osu.edu
Pseudocode Yes Algorithm 1: Robust TD learning
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets No In the first example, we consider a randomly-generated MRP with |X| = 256. The transition kernel is randomly generated such that P(x, x ) iid Unif(0, 1), and rowwise normalized to obtain a stochastic matrix. ... In this example, we consider a circular random walk for X = {1, 2, . . . , 256}...
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes In order to predict the value function, we use (projected) TD learning (see [4]) with linear function approximation based on Gaussian features of dimension d = 4 and projection radius ρ = 30. The discount factor is γ = 0.9, and the reward is Rt(Xt) = r(Xt) + Nt E[Nt] with Nt iid Pareto(1, 1.4) for any t. ... Mean squared error (2) under Robust TD learning and TD learning with the clipping radius bt = t and diminishing step-size ηt = 1 λmin(1 γ)t in Theorem 1 and projection radius ρ = 30 are shown in Figure 2.