Search on the Replay Buffer: Bridging Planning and Reinforcement Learning
Authors: Ben Eysenbach, Russ R. Salakhutdinov, Sergey Levine
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare So RB to prior methods on two tasks: a simple 2D environment, and then a visual navigation task, where our method will plan over images. Ablation experiments will illustrate that accurate distances estimates are crucial to our algorithm s success. |
| Researcher Affiliation | Collaboration | Benjamin Eysenbachθφ, Ruslan Salakhutdinovθ, Sergey Levineφψ θCMU, φGoogle Brain, ψUC Berkeley beysenba@cs.cmu.edu |
| Pseudocode | Yes | Algorithm 1 Inputs are the current state s, the goal state sg, a buffer of observations B, the learned policy π and its value function V . Returns an action a. function SEARCHPOLICY(s, sg, B, V, π) |
| Open Source Code | No | The paper provides a link to a browser-based demo, 'http://bit.ly/rl_search', but not to the source code of the described methodology. |
| Open Datasets | Yes | We use 3D houses from the SUNCG dataset (Song et al., 2017), similar to the task described by Shah et al. (2018). |
| Dataset Splits | No | The paper describes training on 100 SUNCG houses and evaluating on 22 held-out houses, but does not provide specific train/validation/test splits (e.g., percentages or sample counts) for a single dataset. |
| Hardware Specification | No | The paper does not specify any particular GPU or CPU models, or other hardware specifications used for running experiments. |
| Software Dependencies | No | The paper mentions software like DQN, DDPG, and C51, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For VIN, we tuned the number of iterations as well as the number of hidden units in the recurrent layer. For SPTM, we performed a grid search over the threshold for adding edges, the threshold for choosing the next waypoint along the shortest path, and the parameters for sampling the training data. |