Limiting Extrapolation in Linear Approximate Value Iteration

Authors: Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, Emma Brunskill

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our simulations we show that small levels of amplification can be achieved, and that our algorithm can effectively mitigate the divergence observed in some simple MDPs for least-squares AVI. This happens even when using identical feature representations, highlighting the benefit of bounding extrapolation through constructing feature representations as near convex combinations (versus 2 or other common loss functions). Furthermore, we empirically show that small amplification factors can be obtained with relatively small sets of anchor points. 5 Numerical Simulations We investigate the potential benefit of LAVIER over least-squares AVI (LS-AVI). [...] The empirical results are obtained by averaging 100 simulations and they are reported with 95%-confidence intervals.
Researcher Affiliation Collaboration Andrea Zanette Institute for Computational and Mathematical Engineering, Stanford University, CA zanette@stanford.edu Alessandro Lazaric Facebook AI Research lazaric@fb.com Mykel J. Kochenderfer Department of Aeronautics and Astronautics, Stanford University, CA mykel@stanford.edu Emma Brunskill Department of Computer Science, Stanford University, CA ebrun@cs.stanford.edu
Pseudocode Yes Algorithm 1 LAVIER algorithm.
Open Source Code No The paper does not provide any statement or link regarding the public availability of its source code.
Open Datasets No The paper describes different MDP scenarios for its simulations (“Two-state MDP of Tsitsiklis and Van Roy”, “Chain MDP”, “Successive Linear Bandits”) but does not provide any links, DOIs, or formal citations for public datasets used in the experiments. These appear to be custom-defined simulation environments.
Dataset Splits No The paper mentions generating samples and running simulations (e.g., “1000 samples at each timestep”, “The samples are generated uniformly from the left and middle node”), but it does not specify explicit train/validation/test dataset splits for reproduction.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory specifications) used to conduct the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers).
Experiment Setup Yes For simplicity, we set the parameter = 0.01, and add a zero-mean noise to all rewards generated as 1/2 Ber(1/2), where Ber( ) is a Bernoulli random variable. [...] The length of the chain is N = 50, which is also the time horizon. [...] At each state s1, . . . , s N, we represent actions in R2 and we generate 100 actions by uniformly discretizing the circumference. [...] The anchor points for LAVIER are chosen by our adaptive procedure for different value of the extrapolation coefficient C 2 {1.05, 1.2, 1.5}.