reproducibilityindex.ai

The Choice Function Framework for Online Policy Improvement

Authors: Murugeswari Issakkimuthu, Alan Fern, Prasad Tadepalli10178-10185

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	7 Illustrative Empirical Results Our primary contribution is theoretical, however, here we illustrate the potential practical utility of our framework using the LDCF family. The experiments are intended to illustrate the viewpoint of a choice function being a hyperparameter to be tuned ofﬂine. We implemented a variant of Forward Search Sparse Sampling (FSSS) (Walsh, Goschin, and Littman 2010) for approximately computing the online policy Πψ u for any LDCF ψ and leaf-evaluation function u using an MDP simulator. The key parameter, other than the choice-function, is the sampling width C, which controls how many state transitions are sampled for each action node. The supplementary material contains a summary of the algorithm. Our implementation is generally applicable and will be released upon publication with information to reproduce our experiments. Experiments Setup. A full image of our experimental environment will be available upon publication. We run experiments in the domain Game-of-Life, a benchmark domain from the International Probabilistic Planning Competition. This is a grid-based domain with each grid-cell either being alive with some probability depending on its neighbors. Actions allow for selecting one cell at each time step (or none) to set to be alive in the next time step. The reward is based on the number of alive cells at each step. There are 10 problems of grid sizes 3 3, 4 4, 5 5 and 10 3. Problems have different levels of stochasticity in the dynamics. We used supervised learning via imitation of a planner to train two base policies represented as neural networks, using the same approach as in (Issakkimuthu, Fern, and Tadepalli 2018). Each network outputs a probability distribution over actions. Policies 1 and 2 are base policies. Policy 1 is a linear network, while Policy 2 is a non-linear network with 3 hidden layers. For each base policy, we consider four leaf evaluation functions. The ﬁrst is the constant zero function. The remaining three are neural networks with different conﬁgurations trained on a dataset of 5000 state-value pairs obtained via Monte-Carlo simulation of each policy. All the networks have been trained using Tensorﬂow to minimize the mean squared error. We experiment with 11 choice functions in the LDCF family with parameters H in {3, 4, 5} and (D, K) in {(0, 1), (1, 1), (1, 2), (2, 1)}. The combination (2, 1) is not applicable for H = 3. The number of discrepancies considered at the root node is 9 and other internal nodes is 1. The discrepancies are determined from the action probabilities given by the base policy. We use C = 3 for FSSS. Given one of these policies and a problem, we would like to identify the best combination of LDCF parameters and leaf evaluation function given a constraint on the time-per-step. In practice, this could be done in an ofﬂine tuning phase where different conﬁgurations are evaluated. Figure 2 shows a scatter plot of the normalized reward versus time-per-step for each of the 44 conﬁgurations (11 LDCF settings and 4 leaf evaluation functions). The normalized reward is the average reward per episode divided by the average reward per episode of the base policy. Values greater than one perform better than the base policy. For both base policies all LDCF conﬁgurations perform better. There is also a larger improvement for base policy 1, which makes sense due to the fact that policy 2 is a much higher-quality policy and hence more difﬁcult to improve over. We also see that the LDCF space shows considerable coverage of the performance vs. time space, which shows the utility of ofﬂine tuning. There is a general trend toward better performance for larger times, but this is not uniformly true. There are complex interactions between the LDCF parameters and a particular problem, which makes it unlikely that a single feature such as time-per-decision is always the best indicator. Table 1 gives results for each of the 10 problems, which includes the normalized rewards with conﬁdence intervals for the best performing LDCF conﬁguration for each of the policies. We see that for the linear policy, the best LDCF conﬁguration is never signiﬁcantly worse (lower interval is greater than 1) and often signiﬁcantly better. For the second non-linear policy, we see that for most problems the LDCF performance is not signiﬁcantly worse than the policy (conﬁdence interval contains 1) and sometimes signiﬁcantly better. For three problems, the upper conﬁdence bound is less than one, indicating a signiﬁcant decrease in performance. These problems happen to be among the most stochastic problems in the benchmark set. This suggests that a likely reason for the decrease in performance is due to the relatively small sampling width used for FSSS (C = 3), which provides a poor approximation for such problems.
Researcher Affiliation	Academia	Murugeswari Issakkimuthu, Alan Fern, Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University Corvallis, OR 97331, USA
Pseudocode	No	The paper states 'The supplementary material contains a summary of the algorithm' but does not include structured pseudocode or algorithm blocks in the main body of the paper.
Open Source Code	No	Our implementation is generally applicable and will be released upon publication with information to reproduce our experiments.
Open Datasets	Yes	We run experiments in the domain Game-of-Life, a benchmark domain from the International Probabilistic Planning Competition.
Dataset Splits	No	The paper mentions training neural networks on 'a dataset of 5000 state-value pairs' but does not specify how this dataset was split into training, validation, or test sets for reproducibility.
Hardware Specification	No	The paper mentions 'We thank Intel for compute support' but does not provide specific details on the hardware used for running experiments, such as GPU/CPU models or memory.
Software Dependencies	No	The paper states 'All the networks have been trained using TensorFlow' but does not provide a specific version number for TensorFlow or any other software dependencies.
Experiment Setup	Yes	We experiment with 11 choice functions in the LDCF family with parameters H in {3, 4, 5} and (D, K) in {(0, 1), (1, 1), (1, 2), (2, 1)}. The combination (2, 1) is not applicable for H = 3. The number of discrepancies considered at the root node is 9 and other internal nodes is 1. The discrepancies are determined from the action probabilities given by the base policy. We use C = 3 for FSSS. ...All the networks have been trained using Tensorﬂow to minimize the mean squared error.