Sketched Newton Value Iteration for Large-Scale Markov Decision Processes
Authors: Jinsong Liu, Chenghan Xie, Qi Deng, Dongdong Ge, Yinyu Ye
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the superiority of our algorithms over traditional VI and previously proposed secondorder VI algorithms. To validate the effectiveness of our proposed algorithms, we conduct extensive numerical experiments on various MDP instances. |
| Researcher Affiliation | Academia | 1Shanghai University of Finance and Economics 2Fudan University 3Stanford University liujinsong@163.sufe.edu.cn, 20307130043@fudan.edu.cn, qideng@sufe.edu.cn, ge.dongdong@mail.shufe.edu.cn, yyye@stanford.edu |
| Pseudocode | Yes | Algorithm 1: Calculation of the Jacobian Matrix; Algorithm 2: Sketched Newton Value Iteration |
| Open Source Code | No | The paper does not provide any statement about releasing the source code for the NVI or SNVI methodology, nor does it include a link to a code repository. |
| Open Datasets | Yes | We use two structured (forest and machine replacement) and some random (Garnet) MDP instances as benchmark problems. The forest MDP is influenced by the application of dynamic programming in optimizing fire management strategies (Possingham and Tuck 1997). Machine replacement is another widely used MDP representative instance (Delage and Mannor 2010; Wiesemann, Kuhn, and Rustem 2013; Clement and Kroer 2021). And Generalized Average Reward Non-stationary Environment Test-bench (Garnet MDPs) (Archibald, Mc Kinnon, and Thomas 1995) offers a representative set of abstract MDPs that serve as a common benchmark. |
| Dataset Splits | No | The paper describes the MDP instances used but does not specify how these instances or any datasets were split into training, validation, or test sets. It defines a stopping criterion based on error relative to the optimal value function, but this does not describe data partitioning. |
| Hardware Specification | Yes | All experiments are conducted in Python on a Mac OS laptop equipped with a 3.2 GHz 8-Core Apple M1 processor. |
| Software Dependencies | No | The paper mentions 'Python' and 'Numpy' (citing its foundational paper) and refers to 'automatic differentiation techniques' (citing Paszke et al., often associated with PyTorch) and 'MDP toolbox framework', but it does not specify version numbers for any of these software dependencies. |
| Experiment Setup | Yes | The stopping criterion for all the algorithms is based on achieving a difference between the current value function and the optimal value function within a threshold of ϵ(1 γ), given by the inequality: et = vt v ϵ(1 γ). Throughout our experiments, we set ϵ to 0.1 and initialize all algorithms with v0 = 0. In order to effectively portray the results, we utilize logarithmic scaling for both the vertical axis representing the error and the horizontal axis representing the running time in seconds. Notably, for the purpose of preserving numerical stability, we gradually increase the soft max parameter β. We apply the sketching matrix St {0, 1}d s as described in (Gower et al. 2019), which contains precisely one non-zero entry per row and column. |