The Choice Function Framework for Online Policy Improvement
Authors: Murugeswari Issakkimuthu, Alan Fern, Prasad Tadepalli10178-10185
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7 Illustrative Empirical Results Our primary contribution is theoretical, however, here we illustrate the potential practical utility of our framework using the LDCF family. The experiments are intended to illustrate the viewpoint of a choice function being a hyperparameter to be tuned offline. We implemented a variant of Forward Search Sparse Sampling (FSSS) (Walsh, Goschin, and Littman 2010) for approximately computing the online policy Πψ u for any LDCF ψ and leaf-evaluation function u using an MDP simulator. The key parameter, other than the choice-function, is the sampling width C, which controls how many state transitions are sampled for each action node. The supplementary material contains a summary of the algorithm. Our implementation is generally applicable and will be released upon publication with information to reproduce our experiments. Experiments Setup. A full image of our experimental environment will be available upon publication. We run experiments in the domain Game-of-Life, a benchmark domain from the International Probabilistic Planning Competition. This is a grid-based domain with each grid-cell either being alive with some probability depending on its neighbors. Actions allow for selecting one cell at each time step (or none) to set to be alive in the next time step. The reward is based on the number of alive cells at each step. There are 10 problems of grid sizes 3 3, 4 4, 5 5 and 10 3. Problems have different levels of stochasticity in the dynamics. We used supervised learning via imitation of a planner to train two base policies represented as neural networks, using the same approach as in (Issakkimuthu, Fern, and Tadepalli 2018). Each network outputs a probability distribution over actions. Policies 1 and 2 are base policies. Policy 1 is a linear network, while Policy 2 is a non-linear network with 3 hidden layers. For each base policy, we consider four leaf evaluation functions. The first is the constant zero function. The remaining three are neural networks with different configurations trained on a dataset of 5000 state-value pairs obtained via Monte-Carlo simulation of each policy. All the networks have been trained using Tensorflow to minimize the mean squared error. We experiment with 11 choice functions in the LDCF family with parameters H in {3, 4, 5} and (D, K) in {(0, 1), (1, 1), (1, 2), (2, 1)}. The combination (2, 1) is not applicable for H = 3. The number of discrepancies considered at the root node is 9 and other internal nodes is 1. The discrepancies are determined from the action probabilities given by the base policy. We use C = 3 for FSSS. Given one of these policies and a problem, we would like to identify the best combination of LDCF parameters and leaf evaluation function given a constraint on the time-per-step. In practice, this could be done in an offline tuning phase where different configurations are evaluated. Figure 2 shows a scatter plot of the normalized reward versus time-per-step for each of the 44 configurations (11 LDCF settings and 4 leaf evaluation functions). The normalized reward is the average reward per episode divided by the average reward per episode of the base policy. Values greater than one perform better than the base policy. For both base policies all LDCF configurations perform better. There is also a larger improvement for base policy 1, which makes sense due to the fact that policy 2 is a much higher-quality policy and hence more difficult to improve over. We also see that the LDCF space shows considerable coverage of the performance vs. time space, which shows the utility of offline tuning. There is a general trend toward better performance for larger times, but this is not uniformly true. There are complex interactions between the LDCF parameters and a particular problem, which makes it unlikely that a single feature such as time-per-decision is always the best indicator. Table 1 gives results for each of the 10 problems, which includes the normalized rewards with confidence intervals for the best performing LDCF configuration for each of the policies. We see that for the linear policy, the best LDCF configuration is never significantly worse (lower interval is greater than 1) and often significantly better. For the second non-linear policy, we see that for most problems the LDCF performance is not significantly worse than the policy (confidence interval contains 1) and sometimes significantly better. For three problems, the upper confidence bound is less than one, indicating a significant decrease in performance. These problems happen to be among the most stochastic problems in the benchmark set. This suggests that a likely reason for the decrease in performance is due to the relatively small sampling width used for FSSS (C = 3), which provides a poor approximation for such problems. |
| Researcher Affiliation | Academia | Murugeswari Issakkimuthu, Alan Fern, Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University Corvallis, OR 97331, USA |
| Pseudocode | No | The paper states 'The supplementary material contains a summary of the algorithm' but does not include structured pseudocode or algorithm blocks in the main body of the paper. |
| Open Source Code | No | Our implementation is generally applicable and will be released upon publication with information to reproduce our experiments. |
| Open Datasets | Yes | We run experiments in the domain Game-of-Life, a benchmark domain from the International Probabilistic Planning Competition. |
| Dataset Splits | No | The paper mentions training neural networks on 'a dataset of 5000 state-value pairs' but does not specify how this dataset was split into training, validation, or test sets for reproducibility. |
| Hardware Specification | No | The paper mentions 'We thank Intel for compute support' but does not provide specific details on the hardware used for running experiments, such as GPU/CPU models or memory. |
| Software Dependencies | No | The paper states 'All the networks have been trained using TensorFlow' but does not provide a specific version number for TensorFlow or any other software dependencies. |
| Experiment Setup | Yes | We experiment with 11 choice functions in the LDCF family with parameters H in {3, 4, 5} and (D, K) in {(0, 1), (1, 1), (1, 2), (2, 1)}. The combination (2, 1) is not applicable for H = 3. The number of discrepancies considered at the root node is 9 and other internal nodes is 1. The discrepancies are determined from the action probabilities given by the base policy. We use C = 3 for FSSS. ...All the networks have been trained using Tensorflow to minimize the mean squared error. |