Model-Based Relational RL When Object Existence is Partially Observable

Authors: Ngo Ahn Vien, Marc Toussaint

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that the learned belief update rules encode an approximation of the exact belief updates of a POMDP formulation and demonstrate experimentally that the proposed approach successfully learns a set of relational rules appropriate to solve such problems.
Researcher Affiliation Academia Ngo Anh Vien VIEN.NGO@IPVS.UNI-STUTTGART.DE Machine Learning and Robotics Lab, University of Stuttgart, 70569 Germany Marc Toussaint MARC.TOUSSAINT@IPVS.UNI-STUTTGART.DE Machine Learning and Robotics Lab, University of Stuttgart, 70569 Germany
Pseudocode Yes Algorithm 1 belief Augmentation Algorithm
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper mentions generating training data ('20 training examples generated from the true model', 'generate training data of variable size', 'a training set of 200 experience triples'), but it does not refer to any pre-existing public datasets with concrete access information (e.g., links, DOIs, or formal citations).
Dataset Splits No The paper mentions 'training examples' and 'test data set' but does not specify a separate validation set or provide details on train/validation/test splits.
Hardware Specification No The paper mentions using 'the simulator of Lang and Toussaint (Lang & Toussaint, 2010) (using the physics simulation ODE internally)' but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run these simulations or experiments.
Software Dependencies No The paper mentions several software components and algorithms like 'ODE internally', 'SARSOP (Kurniawati et al., 2008)', 'UCT (Kocsis & Szepesv ari, 2006)', and 'PRADA (Lang & Toussaint, 2010)', but it does not provide specific version numbers for any of these to ensure reproducibility.
Experiment Setup Yes UPRL+P uses a horizon d = 4 and N = 200 sample action sequences in PRADA. The settings of SST are: the horizon d = 3, and the branching factor b = 2. For SST, we tried increasing both d and b, however the simulation did not finish after two days. The settings of UCT are: the horizon d = 4, the bias parameter c = 1.0 is the best choice among those we have experimentally tested and the number of sampling is N = 200. All three algorithms use a discount factor γ = 0.95.