Belief-Dependent Macro-Action Discovery in POMDPs using the Value of Information

Authors: Genevieve Flaspohler, Nicholas A. Roy, John W. Fisher III

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In simulated tracking experiments, we achieve higher reward than both closed-loop and hand-coded macro-action baselines, selectively using Vo I macro-actions to reduce planning complexity while maintaining near-optimal task performance.
Researcher Affiliation Academia Genevieve Flaspohler1,2, Nicholas Roy1, and John W. Fisher III1 Massachusetts Intitute of Technology1 and the Woods Hole Oceanographic Institution2 {geflaspo, nickroy, fisher}@csail.mit.edu
Pseudocode Yes We modify the standard value iteration backup operation to compute the Vo I, adding open-loop backups whenever the Vo I is low. An algorithm summary is presented in the supplement. ... An algorithm summary for macro-action chaining is presented in the supplement.
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper describes a simulated tracking problem in a '10x10 discretized map' and mentions 'simulated tracking experiments', but it does not refer to a named, publicly available dataset with concrete access information (e.g., link, DOI, formal citation).
Dataset Splits No The paper does not explicitly provide details about training, validation, and test dataset splits with percentages or sample counts, nor does it reference predefined splits.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models, memory, or cloud instance types.
Software Dependencies No The paper states 'value function approximation is performed using a custom implementation of PBVI [15]', but does not provide specific version numbers for software libraries or dependencies. PBVI is an algorithm, not a specific software with a version.
Experiment Setup Yes We assume discrete states, actions and observations and represent the value function by a piecewise-linear and convex (PWLC) collection of α-vectors [9]. An adaptation of the algorithm presented in Section 4 to a PWLC value function is contained in the supplement (Section C). We demonstrate macro-action generation in a dynamic tracking problem (Fig. 4), in which a fully observable, actuated agent tracks a partially observable target moving in a known 10 10 discretized map (|S| = 10, 000). A full description of the experimental domain and parameterization is in the supplement (Section D). ... The Vo I macro-action policy (Vo I MA) is compared against an approximation to the closed-loop optimal policy (base closed-loop, Base CL) and a fixed length macro-action (Fixed MA) policy, which is constrained to act closed-loop only every T = 15 planning iterations. ... We additionally explore the effect of the Vo I threshold τ on planner performance, macro-action utilization, and the value of δB.