Commitment Semantics for Sequential Decision Making under Reward Uncertainty
Authors: Qi Zhang, Edmund Durfee, Satinder Singh, Anna Chen, Stefan Witwicki
IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now present a preliminary empirical evaluation comparing the runtime and solution quality of CCIMR to the MR and EBS algorithms on the following two sample problems. All algorithms were implemented and run on a 64-bit Windows machine with 1.8 GHz CPU and 4 GB RAM.Figure 2: Results on the Gate Control problem. Leftmost: The expected value of the robot under various commitment probabilities with = 4 and T = 5. Middle Left: Where = 5 and T = 5. Middle Right: The expected joint value of the robot and the user under various commitment probabilities. Rightmost: Computation time as the time horizon scales up.Table 1: Results on Committed Rock Sample. |
| Researcher Affiliation | Collaboration | Qi Zhang, Edmund Durfee, Satinder Singh University of Michigan {qizhg,durfee,baveja}@umich.edu Quora, Inc. anna1110@gmail.com Stefan Witwicki Nissan Research Center stefan.witwicki@nissan-usa.com |
| Pseudocode | Yes | CCIMR Algorithm Description. Our Commitment Constrained Iterative Mean Reward (CCIMR) algorithm, described below, updates the robot s policy according to its reward observations while still achieving the commitment. 1. Initialize/update reward belief. Use prior knowledge/standard POMDP Bayes rule to establish/update the probability distribution over reward functions as µt. 2. Update mean reward. If the belief changed, compute the mean reward as: Rµt(s, a) = µt(k)Rk(s, a) (6) 3. Update optimal constrained policy. If the mean reward changed, update current policy to t, ensuring t 2 | t 1: t = arg max 2 | t 1 V Rµt (st) (7) 4. Take stochastic action prescribed by the current policy, and loop until the time horizon is reached. |
| Open Source Code | No | The paper does not contain any statement about making code open source or providing a link to a code repository. |
| Open Datasets | No | The paper refers to the “Gate Control” problem and a “variant of the Rock Sample problem [Smith and Simmons, 2004]”. These are problem descriptions or simulation environments, not explicitly described as publicly available datasets with concrete access information or standard citations for datasets. |
| Dataset Splits | No | The paper does not provide specific dataset split information (e.g., percentages, sample counts for training, validation, or test sets). |
| Hardware Specification | Yes | All algorithms were implemented and run on a 64-bit Windows machine with 1.8 GHz CPU and 4 GB RAM. |
| Software Dependencies | No | The paper does not specify any software dependencies (e.g., programming languages, libraries, or solvers) with specific version numbers. |
| Experiment Setup | Yes | The left two plots compare the robot s expected value of CCIMR to MR and EBS under different commitment probabilities when the time horizon T is 5 and the commitment time is 4 and 5, respectively. Given any arbitrary commitment, the expected value computed by CCIMR is indeed between those of EBS and MR, with substantial improvement over MR. Under a deterministic policy, even though the commitment probabilities of opening the gate are continuous, the achievable probabilities can only be discrete. Hence, the expected value of different commitment probabilities under a deterministic policy is always stepwise. Stochastic policies can do strictly better because they can achieve finer tradeoffs between the commitment constraints and rewards, as shown in the middle left plot of Figure 2. To find the best commitment that optimizes the expected joint value of the robot and the user, the set of feasible commitment probabilities is discretized with granularity = 0.02. We constrained the rover with T n commitments such that the probability of leaving the region grows linearly as the time approaches the horizon: = max(0, n+1 T n+1), 8 T ( = 0 when < n because it takes the rover at least n time steps to leave the region). |