Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards
Authors: Semih Cayci, Atilla Eryilmaz
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We corroborate our theoretical results with numerical experiments. ... In this section, we present numerical results for Robust TD learning and its non-robust counterpart. |
| Researcher Affiliation | Academia | Semih Cayci Department of Mathematics RWTH Aachen University Aachen, Germany cayci@mathc.rwth-aachen.de Atilla Eryilmaz Department of Electrical and Computer Engineering The Ohio State University Columbus, OH 43210 eryilmaz.2@osu.edu |
| Pseudocode | Yes | Algorithm 1: Robust TD learning |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. |
| Open Datasets | No | In the first example, we consider a randomly-generated MRP with |X| = 256. The transition kernel is randomly generated such that P(x, x ) iid Unif(0, 1), and rowwise normalized to obtain a stochastic matrix. ... In this example, we consider a circular random walk for X = {1, 2, . . . , 256}... |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | In order to predict the value function, we use (projected) TD learning (see [4]) with linear function approximation based on Gaussian features of dimension d = 4 and projection radius ρ = 30. The discount factor is γ = 0.9, and the reward is Rt(Xt) = r(Xt) + Nt E[Nt] with Nt iid Pareto(1, 1.4) for any t. ... Mean squared error (2) under Robust TD learning and TD learning with the clipping radius bt = t and diminishing step-size ηt = 1 λmin(1 γ)t in Theorem 1 and projection radius ρ = 30 are shown in Figure 2. |