How hard is my MDP?" The distribution-norm to the rescue"
Authors: Odalric-Ambrym Maillard, Timothy A Mann, Shie Mannor
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Table 1: MDPs marked with a indicate that the true MDP was not available and so it was estimated from samples. We estimated these MDPs with 10, 000 samples from each stateaction pair. MDPs marked with a ' indicate that the original MDP is deterministic and therefore we added noise to the transition dynamics. For the Mountain Car problem, we added a small amount of noise to the vehicle s velocity during each step (post+1 = post + velt(1 + X) where X is a random variable with equally probable events { vel MAX, 0, vel MAX}). For the pinball domain we added noise similar to Tamar et al. (2013). MDPs marked with a were discretized to create a finite state MDP. The rewards of all MDPs were normalized to [0, 1] and discount factor γ = 0.95 was used. Figure 1: Comparison of the Weissman et al. (2003) bound VMAX to the bound given by Theorem 1 Cπ M in the benchmark MDPs. |
| Researcher Affiliation | Academia | Odalric-Ambrym Maillard The Technion, Haifa, Israel odalric-ambrym.maillard@ens-cachan.org Timothy A. Mann The Technion, Haifa, Israel mann.timothy@gmail.com Shie Mannor The Technion, Haifa, Israel shie@ee.technion.ac.il |
| Pseudocode | No | The paper discusses algorithmic modifications but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that source code for the described methodology is available. |
| Open Datasets | Yes | Table 1 refers to and cites well-known benchmark MDPs from prior literature, such as 'bottleneck Mc Govern and Barto (2001)', 'red herring Hester and Stone (2009)', 'taxi Dietterich (1998)', 'inventory Mankowitz et al. (2014)', 'mountain car Sutton and Barto (1998)', and 'pinball Konidaris and Barto (2009)'. |
| Dataset Splits | No | The paper describes how some MDPs were estimated from samples ('estimated these MDPs with 10, 000 samples from each stateaction pair') or how noise was added to existing MDPs, but it does not specify train/validation/test dataset splits typically used for training and evaluating machine learning models. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not list any specific software dependencies or their version numbers that would be necessary for reproducing the experiments. |
| Experiment Setup | Yes | In Section 3.2, 'The hardness of benchmarks MDPs', the paper states: 'The rewards of all MDPs were normalized to [0, 1] and discount factor γ = 0.95 was used.' It also mentions: 'We estimated these MDPs with 10, 000 samples from each stateaction pair.' and describes specific noise additions for Mountain Car and Pinball, and states 'we ran policy iteration on each of the benchmark MDPs from Table 3.2 for 100 iterations'. |