Operator Splitting Value Iteration

Authors: Amin Rakhsha, Andrew Wang, Mohammad Ghavamzadeh, Amir-massoud Farahmand

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate both OS-VI and OS-Dyna in a finite MDP and compare them with existing methods. Here we present the results for the Control problem on a modified cliffwalk environment in a 6x6 grid with 4 actions (UP, DOWN, LEFT, RIGHT). The left plot in Figure 2 shows the convergence of OS-VI compared to VI and the solutions the model itself would lead to. The plot shows normalized error of Vk V * w.r.t V * . (Right) Comparison of OS-Dyna with Dyna and Q-Learning in the RL setting.
Researcher Affiliation Collaboration Amin Rakhsha1,2 Andrew Wang1,2 Mohammad Ghavamzadeh3 Amir-massoud Farahmand2,1 1Department of Computer Science, University of Toronto 2Vector Institute 3Google Research
Pseudocode Yes Algorithm 1 OS-Dyna
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The details are in the supplementary material.
Open Datasets No The paper mentions a 'modified cliffwalk environment in a 6x6 grid' but does not provide a link or citation to a public dataset, nor does it explicitly state its public availability.
Dataset Splits No The paper describes an RL setup where 'algorithms are given a sample (Xt, At, Rt, X't)' but does not specify traditional training, validation, or test dataset splits.
Hardware Specification No The experiments are simple and can be run on a personal computer.
Software Dependencies No The paper does not provide specific software names with version numbers.
Experiment Setup Yes The learning rates are constant α for iterations t ≥ N and then decay in the form of αt = α/(t − N) afterwards. We have fine-tuned the learning rate schedule for each algorithm separately for the best results.