Goal-HSVI: Heuristic Search Value Iteration for Goal POMDPs

Authors: Karel Horák, Branislav Bošanský, Krishnendu Chatterjee

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an experimental evaluation of Goal-HSVI in comparison with RTDP-Bel [Bonet and Geffner, 2009]. We do not include vanilla-HSVI (HSVI2 [Smith and Simmons, 2005]) in the experiments as the algorithm without modifications is theoretically incorrect in the Goal-POMDP setting, and it indeed crashed due to the recursion limit on some instances (Hallway 2, Seq[5,5,3], Seq[5,5,4]). We start with the description of the setting of the algorithms considered. Goal-HSVI. Our implementation is based on the ZMDP1 implementation of HSVI2. We updated the solver according to Section 5. Few other changes were made: (1) We do not use adaptive precision from ZMDP that changes ϵ between iterations (fixed values ϵ = 2 and η = 0.8 are used). (2) We do not use α-vector masking as the implementation of this technique in ZMDP is incompatible with Goal-POMDPs. We terminate the algorithm after 900s if an ϵ-optimal solution is not found. RTDP-Bel. GPT solver2 is used as a reference implementation of RTDP-Bel. Since there are no guidelines for choosing the parameters of RTDP-Bel, we use the default values used in GPT (most importantly, K = 15 as in [Bonet and Geffner, 2009]) except for increasing the cutoffparameter from 250 to 2000. In our experiments we let RTDP-Bel perform 150,000 trials before terminating. As RTDP-Bel is a randomized algorithm, we perform 12 independent runs and report the result with the lowest average cost. We consider the cost of RTDPBel policies as a reference, and we report the time when Goal HSVI finds a policy of the same quality as t RTDP. Policy evaluation. We evaluate the quality of the policies computed by the algorithms using simulation. We perform 250,000 simulated plays (we cut each of them after 2,000 steps if the goal is not reached by that time) and we report the average total cost. We also report the percentage of simulated plays that did not terminate within the limit. We evaluate the performance of our Goal-HSVI algorithm on three different domains. The domains are: Hallway [Littman et al., 1995] and Rock Sample [Smith and Simmons, 2004] in their Goal-POMDP variants, and a new domain of Sequencing (inspired by [Kress-Gazit et al., 2009]). Hallway [Littman et al., 1995]. An agent is navigating in a maze trying to reach the goal location while using unreliable actuators and sensors. In the original version, the agent receives a reward only when the goal is reached, and the discounted-sum objective is considered. For the Goal POMDP version of the problem, we assume that the goal state is absorbing and each step of the agent costs one unit. [Bonet and Geffner, 2009] observed that RTDP-Bel had been outperformed by HSVI2 in the discounted-sum setting. Our Goal-HSVI algorithm similarly outperforms RTDP-Bel in the Goal-POMDP variant of Hallway (see Table 1). Moreover, the upper bound on cost produced by Goal-HSVI after 2s is lower than the cost of RTDP-Bel, and unlike RTDP-Bel, the policy produced by our algorithm always reached the goal in less than 2000 steps. (RTDP-Bel failed to reach goal in 2.5% and 16.5% of plays on Hallway and Hallway 2, respectively.) Rock Sample[n,k] [Smith and Simmons, 2004]. A robot is operating in an n n grid with k rocks. Each of the rocks can be either good or bad (unknown to the agent). The goal is to sample all the good rocks (approach them and perform expensive sampling) and then leave the map. In the Goal POMDP version [Chatterjee et al., 2016], a cost is associated with each movement. Moreover, the agent pays a penalty for all the good rocks he failed to sample upon leaving the map. RTDP-Bel works well on discounted Rock Sample [Bonet and Geffner, 2009] due to the problem structure (e.g., knowledge of the current position), and the same is expected in the Goal-POMDP setting. Although Goal-HSVI does not leverage the problem structure, it is competitive on all Rock Sample instances we consider, see Table 1. Moreover, it provably found solutions of a comparable (or even better) quality as RTDP-Bel by decreasing the upper bound on cost (see t RTDP for the time required). Recall that RTDP-Bel cannot provide any such guarantees on the quality of the computed policy. Sequencing[n,k,t]. An agent inspects t targets in an n n grid with k obstacles (see Figure 5). He is uncertain about his position, and he has 5 actions available 4 movement actions N, S, W, E and the inspection action. The movement actions are not reliable and may result in a step in any unintended direction with probability 0.15. The inspection action is deterministic and inspects the target (if there is one at the current position). The agent may receive two observations either the last action succeeded (he stepped on an empty square / inspected a target) or it failed (he hit an obstacle / there is no target to inspect). The agent has to inspect the targets in a prescribed order otherwise he pays a penalty 100t where t is the number of targets he should have inspected earlier. For example, if he inspects targets in Figure 5 in the order (4, 1, 3, 2), he accumulates a penalty 400. We observe that RTDP-Bel does not work well for Sequencing and fails to find a policy reaching the goal state reliably, especially on the larger two instances. In contrary, our Goal-HSVI algorithm produces superior policies that always reached the goal (see Table 1). Notice that on Sequencing[5,5,3] and Sequencing[5,5,4], the time to complete 150,000 trials of RTDP-Bel is comparable to the time given to Goal-HSVI, yet still, the policy of RTDP-Bel is inferior.
Researcher Affiliation Academia Karel Hor ak1, Branislav Boˇsansk y1, Krishnendu Chatterjee2 1 Department of Computer Science, FEE, Czech Technical University in Prague 2 IST Austria, Klosterneuburg, Austria
Pseudocode Yes Algorithm 1: A single trial of the RTDP-Bel algorithm. Algorithm 2: Point-based update(b) procedure of (V , V ). Algorithm 3: HSVI2 for discounted POMDPs. The pseudocode follows the ZMDP implementation and includes update on line 6. Algorithm 4: Goal-HSVI, η [0, 1)
Open Source Code No The paper states, 'Our implementation is based on the ZMDP1 implementation of HSVI2.' and 'GPT solver2 is used as a reference implementation of RTDP-Bel.' It provides links to these external tools. However, it does not explicitly state that the authors are releasing the source code for their specific Goal-HSVI implementation.
Open Datasets Yes We evaluate the performance of our Goal-HSVI algorithm on three different domains. The domains are: Hallway [Littman et al., 1995] and Rock Sample [Smith and Simmons, 2004] in their Goal-POMDP variants, and a new domain of Sequencing (inspired by [Kress-Gazit et al., 2009]).
Dataset Splits No The paper evaluates policies via simulation on POMDP environments rather than using traditional train/validation/test dataset splits. It describes the simulation protocol ('We perform 250,000 simulated plays...'), but there is no explicit mention of partitioning a dataset into training, validation, or test sets for model training or evaluation in the conventional sense.
Hardware Specification Yes Table 1: Experimental results (on Intel Core i7-8700K).
Software Dependencies No The paper mentions 'ZMDP1 implementation of HSVI2' and 'GPT solver2' but does not specify their version numbers or any other software dependencies with version details.
Experiment Setup Yes Our implementation is based on the ZMDP1 implementation of HSVI2. We updated the solver according to Section 5. Few other changes were made: (1) We do not use adaptive precision from ZMDP that changes ϵ between iterations (fixed values ϵ = 2 and η = 0.8 are used). (2) We do not use α-vector masking as the implementation of this technique in ZMDP is incompatible with Goal-POMDPs. We terminate the algorithm after 900s if an ϵ-optimal solution is not found. In our experiments we let RTDP-Bel perform 150,000 trials before terminating. ... We perform 250,000 simulated plays (we cut each of them after 2,000 steps if the goal is not reached by that time).