Generalization and Exploration via Randomized Value Functions
Authors: Ian Osband, Benjamin Van Roy, Zheng Wen
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We will present computational results comparing RLSVI to LSVI with action-dithering schemes... The results demonstrate that RLSVI enjoys dramatic efficiency gains. Further, we establish a bound on the expected regret for an episodic tabula rasa learning context... |
| Researcher Affiliation | Collaboration | Ian Osband1,2 IOSBAND@STANFORD.EDU Benjamin Van Roy1 BVR@STANFORD.EDU Zheng Wen1,3 ZWEN@ADOBE.COM 1Stanford University, 2Google Deepmind, 3Adobe Research |
| Pseudocode | Yes | Algorithm 1 Randomized Least-Squares Value Iteration and Algorithm 2 RLSVI with greedy action |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code or a link to a code repository for the described methodology. |
| Open Datasets | No | The paper describes custom or simulated environments for its experiments (didactic chain environments, Tetris game, recommendation engine model) and does not provide concrete access information (links, DOIs, formal citations) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific details regarding train, validation, or test dataset splits (e.g., percentages, sample counts, or references to predefined splits) for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names with version numbers, or specific solver versions, needed to replicate the experiments. |
| Experiment Setup | Yes | Figure 2 presents the empirical regret for RLSVI with K=10,N = 50,σ=0.1,λ=1 and an -greedy agent over 5 seeds. ... In Figure 7 we present learning curves for RLSVI λ=1,σ=1 ... For our simulations we set βa = 0 8a and sample a random problem instance by sampling γan N(0, c2) independently for each a and n. ... We set N = 10, H = J = 5, c = 2 and L = 1200. ... The cumulative regret for both RLSVI (with λ = 0.2 and σ2 = 10 3) and LSVI with Boltzmann exploration (with λ = 0.2 and a variety of temperature settings ) are plotted in Figure 8. |