Generalization and Exploration via Randomized Value Functions

Authors: Ian Osband, Benjamin Van Roy, Zheng Wen

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We will present computational results comparing RLSVI to LSVI with action-dithering schemes... The results demonstrate that RLSVI enjoys dramatic efficiency gains. Further, we establish a bound on the expected regret for an episodic tabula rasa learning context...
Researcher Affiliation Collaboration Ian Osband1,2 IOSBAND@STANFORD.EDU Benjamin Van Roy1 BVR@STANFORD.EDU Zheng Wen1,3 ZWEN@ADOBE.COM 1Stanford University, 2Google Deepmind, 3Adobe Research
Pseudocode Yes Algorithm 1 Randomized Least-Squares Value Iteration and Algorithm 2 RLSVI with greedy action
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the described methodology.
Open Datasets No The paper describes custom or simulated environments for its experiments (didactic chain environments, Tetris game, recommendation engine model) and does not provide concrete access information (links, DOIs, formal citations) for a publicly available or open dataset.
Dataset Splits No The paper does not provide specific details regarding train, validation, or test dataset splits (e.g., percentages, sample counts, or references to predefined splits) for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers, or specific solver versions, needed to replicate the experiments.
Experiment Setup Yes Figure 2 presents the empirical regret for RLSVI with K=10,N = 50,σ=0.1,λ=1 and an -greedy agent over 5 seeds. ... In Figure 7 we present learning curves for RLSVI λ=1,σ=1 ... For our simulations we set βa = 0 8a and sample a random problem instance by sampling γan N(0, c2) independently for each a and n. ... We set N = 10, H = J = 5, c = 2 and L = 1200. ... The cumulative regret for both RLSVI (with λ = 0.2 and σ2 = 10 3) and LSVI with Boltzmann exploration (with λ = 0.2 and a variety of temperature settings ) are plotted in Figure 8.