Online Symbolic Gradient-Based Optimization for Factored Action MDPs

Authors: Hao Cui, Roni Khardon

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental An experimental evaluation on benchmark problems shows that the algorithm is competitive with state of the art across problem sizes and that it provides significant improvements for large factored action spaces. The top two rows in Figure 2 show a comparison of SOGBOFA to the baselines in all 6 domains showing a significant advantage on the problems with large action spaces.
Researcher Affiliation Academia Hao Cui and Roni Khardon Department of Computer Science, Tufts University, Medford, Massachusetts, USA Hao.Cui@tufts.edu, roni@cs.tufts.edu
Pseudocode Yes Section 3, 'Gradient Based Optimization', presents a numbered list of steps for the SOGBOFA algorithm, formatted like pseudocode: '1 Qf Build Qf(S, time Allowed) 2 As = { } 3 while time remaining 4 do A Random Restart() 5 while time remaining and not converged 6 do D Calculate Gradient(Qf) 7 A Make Updates(D) 8 A Projection(A) 9 As.add(Sample Concrete Act(A)) 10 action Best(As)'
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes The evaluation is done over 6 domains, elevators from the RDDL distribution, two from the International Probabilistic Planning Competition (IPPC) 2011, and three from IPPC 2014 where we chose domains amenable to expansion of the action space.
Dataset Splits No The paper describes using problem instances from RDDL and IPPC competitions and running experiments on them (e.g., '20 runs in each problem instance where each instance is run for 40 time steps'). However, it does not specify explicit training/validation/test dataset splits in terms of percentages or sample counts for these problem instances, nor does it refer to standard splits with citations.
Hardware Specification Yes The reported results represent averages from 20 runs in each problem instance where each instance is run for 40 time steps. In these runs each algorithm is given 60 seconds per step (on a cluster node having Intel Xeon X5675@ 3GHz CPU, and 24GB memory).
Software Dependencies No The paper mentions using the RDDL language ('we use the RDDL language [Sanner, 2010] to specify the domain') for domain specification, but does not provide specific version numbers for any software libraries, frameworks, or tools used in the implementation or experimentation.
Experiment Setup Yes The algorithm limits depth when the expected number of gradient updates is below a pre-fixed threshold. ... (where k = 200 in our experiments). ... Our implementation stops the gradient search if the max change in probability is less than S = 0.1, that is, k Anew Aoldk1 0.1. We then perform a linear search using 10 evenly spaced values a0, a1...a9 in that region to find the best where quality is measured by the value of the Q function DAG.