Oracle-Efficient Regret Minimization in Factored MDPs with Unknown Structure

Authors: Aviv Rosenberg, Yishay Mansour

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our algorithm on the Sys Admin domain [Guestrin et al., 2003] a network of servers connected by some topology, where failing servers affect the probability of their neighbors to fail and the admin chooses which server to reboot at each time step. Our experiments show that the performance of SLF-UCRL is comparable to that of Factored-UCRL [Osband and Van Roy, 2014] that knows the factored structure in advance, and significantly better than the performance of UCRL [Jaksch et al., 2010] that completely ignores the factorization. Figure 1 shows that, for circular topology with 4 servers (i.e., 4 state factors and scope size 3), SLF-UCRL eliminates the wrong scopes (right figure), and has similar regret to Factored-UCRL (left figure).
Researcher Affiliation Collaboration Aviv Rosenberg Tel-Aviv University avivros007@gmail.com Yishay Mansour Tel-Aviv University and Google Research, Tel Aviv mansour@tau.ac.il
Pseudocode Yes Algorithm 1 SLF-UCRL Sketch Input: δ, m, S = {Si}d i=1, S A = X = {Xi}n i=1. Initialize visit counters and sets of consistent scopes. for k = 1, 2, . . . do Start new episode k, and compute empirical transition function P k and confidence bounds ϵk. Eliminate inconsistent scopes (Algorithm 2), and construct optimistic MDP f M k. Compute optimal policy πk of f M k using oracle, and extract optimistic policy πk. Execute policy πk until there are scopes Z = Z of size m and v X[Z Z ] such that the number of visits to some state-action pairs x with x[Z Z ] = v, is doubled. end for Algorithm 2 Eliminate Inconsistent Scopes Sketch for i = 1, . . . , d and Z e Zk 1 i do for Z {1, . . . , n} of size m and v X[Z Z ] and w Si do if | P k i,Z Z (w | v) P k i,Z(w | v[Z])| > 2 ϵk i,Z Z (w | v) then Eliminate inconsistent scope: e Zk i e Zk i \ {Z}, and BREAK. end if end for end for # Inconsistent reward scopes are eliminated from e Rk j similarly, for every j = 1, . . . , ℓ.
Open Source Code No The paper does not provide a direct link to open-source code for the described methodology or state that it is publicly released.
Open Datasets Yes We test our algorithm on the Sys Admin domain [Guestrin et al., 2003] a network of servers connected by some topology, where failing servers affect the probability of their neighbors to fail and the admin chooses which server to reboot at each time step.
Dataset Splits No The paper mentions using the "Sys Admin domain" but does not specify any training, validation, or test dataset splits or percentages.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions, or specific solvers/libraries).
Experiment Setup No The paper mentions experiments in Section 7 and refers to Appendix G for "implementation details", but the provided text does not contain specific hyperparameters, training configurations, or system-level settings.