Sketched Newton Value Iteration for Large-Scale Markov Decision Processes

Authors: Jinsong Liu, Chenghan Xie, Qi Deng, Dongdong Ge, Yinyu Ye

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superiority of our algorithms over traditional VI and previously proposed secondorder VI algorithms. To validate the effectiveness of our proposed algorithms, we conduct extensive numerical experiments on various MDP instances.
Researcher Affiliation Academia 1Shanghai University of Finance and Economics 2Fudan University 3Stanford University liujinsong@163.sufe.edu.cn, 20307130043@fudan.edu.cn, qideng@sufe.edu.cn, ge.dongdong@mail.shufe.edu.cn, yyye@stanford.edu
Pseudocode Yes Algorithm 1: Calculation of the Jacobian Matrix; Algorithm 2: Sketched Newton Value Iteration
Open Source Code No The paper does not provide any statement about releasing the source code for the NVI or SNVI methodology, nor does it include a link to a code repository.
Open Datasets Yes We use two structured (forest and machine replacement) and some random (Garnet) MDP instances as benchmark problems. The forest MDP is influenced by the application of dynamic programming in optimizing fire management strategies (Possingham and Tuck 1997). Machine replacement is another widely used MDP representative instance (Delage and Mannor 2010; Wiesemann, Kuhn, and Rustem 2013; Clement and Kroer 2021). And Generalized Average Reward Non-stationary Environment Test-bench (Garnet MDPs) (Archibald, Mc Kinnon, and Thomas 1995) offers a representative set of abstract MDPs that serve as a common benchmark.
Dataset Splits No The paper describes the MDP instances used but does not specify how these instances or any datasets were split into training, validation, or test sets. It defines a stopping criterion based on error relative to the optimal value function, but this does not describe data partitioning.
Hardware Specification Yes All experiments are conducted in Python on a Mac OS laptop equipped with a 3.2 GHz 8-Core Apple M1 processor.
Software Dependencies No The paper mentions 'Python' and 'Numpy' (citing its foundational paper) and refers to 'automatic differentiation techniques' (citing Paszke et al., often associated with PyTorch) and 'MDP toolbox framework', but it does not specify version numbers for any of these software dependencies.
Experiment Setup Yes The stopping criterion for all the algorithms is based on achieving a difference between the current value function and the optimal value function within a threshold of ϵ(1 γ), given by the inequality: et = vt v ϵ(1 γ). Throughout our experiments, we set ϵ to 0.1 and initialize all algorithms with v0 = 0. In order to effectively portray the results, we utilize logarithmic scaling for both the vertical axis representing the error and the horizontal axis representing the running time in seconds. Notably, for the purpose of preserving numerical stability, we gradually increase the soft max parameter β. We apply the sketching matrix St {0, 1}d s as described in (Gower et al. 2019), which contains precisely one non-zero entry per row and column.