Operator Splitting Value Iteration
Authors: Amin Rakhsha, Andrew Wang, Mohammad Ghavamzadeh, Amir-massoud Farahmand
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate both OS-VI and OS-Dyna in a finite MDP and compare them with existing methods. Here we present the results for the Control problem on a modified cliffwalk environment in a 6x6 grid with 4 actions (UP, DOWN, LEFT, RIGHT). The left plot in Figure 2 shows the convergence of OS-VI compared to VI and the solutions the model itself would lead to. The plot shows normalized error of Vk V * w.r.t V * . (Right) Comparison of OS-Dyna with Dyna and Q-Learning in the RL setting. |
| Researcher Affiliation | Collaboration | Amin Rakhsha1,2 Andrew Wang1,2 Mohammad Ghavamzadeh3 Amir-massoud Farahmand2,1 1Department of Computer Science, University of Toronto 2Vector Institute 3Google Research |
| Pseudocode | Yes | Algorithm 1 OS-Dyna |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The details are in the supplementary material. |
| Open Datasets | No | The paper mentions a 'modified cliffwalk environment in a 6x6 grid' but does not provide a link or citation to a public dataset, nor does it explicitly state its public availability. |
| Dataset Splits | No | The paper describes an RL setup where 'algorithms are given a sample (Xt, At, Rt, X't)' but does not specify traditional training, validation, or test dataset splits. |
| Hardware Specification | No | The experiments are simple and can be run on a personal computer. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers. |
| Experiment Setup | Yes | The learning rates are constant α for iterations t ≥ N and then decay in the form of αt = α/(t − N) afterwards. We have fine-tuned the learning rate schedule for each algorithm separately for the best results. |