Model-Based Relational RL When Object Existence is Partially Observable
Authors: Ngo Ahn Vien, Marc Toussaint
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that the learned belief update rules encode an approximation of the exact belief updates of a POMDP formulation and demonstrate experimentally that the proposed approach successfully learns a set of relational rules appropriate to solve such problems. |
| Researcher Affiliation | Academia | Ngo Anh Vien VIEN.NGO@IPVS.UNI-STUTTGART.DE Machine Learning and Robotics Lab, University of Stuttgart, 70569 Germany Marc Toussaint MARC.TOUSSAINT@IPVS.UNI-STUTTGART.DE Machine Learning and Robotics Lab, University of Stuttgart, 70569 Germany |
| Pseudocode | Yes | Algorithm 1 belief Augmentation Algorithm |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper mentions generating training data ('20 training examples generated from the true model', 'generate training data of variable size', 'a training set of 200 experience triples'), but it does not refer to any pre-existing public datasets with concrete access information (e.g., links, DOIs, or formal citations). |
| Dataset Splits | No | The paper mentions 'training examples' and 'test data set' but does not specify a separate validation set or provide details on train/validation/test splits. |
| Hardware Specification | No | The paper mentions using 'the simulator of Lang and Toussaint (Lang & Toussaint, 2010) (using the physics simulation ODE internally)' but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run these simulations or experiments. |
| Software Dependencies | No | The paper mentions several software components and algorithms like 'ODE internally', 'SARSOP (Kurniawati et al., 2008)', 'UCT (Kocsis & Szepesv ari, 2006)', and 'PRADA (Lang & Toussaint, 2010)', but it does not provide specific version numbers for any of these to ensure reproducibility. |
| Experiment Setup | Yes | UPRL+P uses a horizon d = 4 and N = 200 sample action sequences in PRADA. The settings of SST are: the horizon d = 3, and the branching factor b = 2. For SST, we tried increasing both d and b, however the simulation did not finish after two days. The settings of UCT are: the horizon d = 4, the bias parameter c = 1.0 is the best choice among those we have experimentally tested and the number of sampling is N = 200. All three algorithms use a discount factor γ = 0.95. |