Limiting Extrapolation in Linear Approximate Value Iteration
Authors: Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, Emma Brunskill
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our simulations we show that small levels of amplification can be achieved, and that our algorithm can effectively mitigate the divergence observed in some simple MDPs for least-squares AVI. This happens even when using identical feature representations, highlighting the benefit of bounding extrapolation through constructing feature representations as near convex combinations (versus 2 or other common loss functions). Furthermore, we empirically show that small amplification factors can be obtained with relatively small sets of anchor points. 5 Numerical Simulations We investigate the potential benefit of LAVIER over least-squares AVI (LS-AVI). [...] The empirical results are obtained by averaging 100 simulations and they are reported with 95%-confidence intervals. |
| Researcher Affiliation | Collaboration | Andrea Zanette Institute for Computational and Mathematical Engineering, Stanford University, CA zanette@stanford.edu Alessandro Lazaric Facebook AI Research lazaric@fb.com Mykel J. Kochenderfer Department of Aeronautics and Astronautics, Stanford University, CA mykel@stanford.edu Emma Brunskill Department of Computer Science, Stanford University, CA ebrun@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1 LAVIER algorithm. |
| Open Source Code | No | The paper does not provide any statement or link regarding the public availability of its source code. |
| Open Datasets | No | The paper describes different MDP scenarios for its simulations (“Two-state MDP of Tsitsiklis and Van Roy”, “Chain MDP”, “Successive Linear Bandits”) but does not provide any links, DOIs, or formal citations for public datasets used in the experiments. These appear to be custom-defined simulation environments. |
| Dataset Splits | No | The paper mentions generating samples and running simulations (e.g., “1000 samples at each timestep”, “The samples are generated uniformly from the left and middle node”), but it does not specify explicit train/validation/test dataset splits for reproduction. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory specifications) used to conduct the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers). |
| Experiment Setup | Yes | For simplicity, we set the parameter = 0.01, and add a zero-mean noise to all rewards generated as 1/2 Ber(1/2), where Ber( ) is a Bernoulli random variable. [...] The length of the chain is N = 50, which is also the time horizon. [...] At each state s1, . . . , s N, we represent actions in R2 and we generate 100 actions by uniformly discretizing the circumference. [...] The anchor points for LAVIER are chosen by our adaptive procedure for different value of the extrapolation coefficient C 2 {1.05, 1.2, 1.5}. |