Model-Free Model Reconciliation
Authors: Sarath Sreedharan, Alberto Olmo Hernandez, Aditya Prasad Mishra, Subbarao Kambhampati
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we will evaluate our method on a set of standard MDP benchmarks and perform user studies to validate its viability (Section 6). We validated our approach on both simulations and on data collected from users. Figure 3 plots the average test accuracy for models trained with training sets of varying sizes. We found the model to have an average 10-fold cross validation score of 0.935. For a randomly generated train and test split (where the test split was 10% and contained around 7% inexpicable labels) the precision score was 0.9637 and the recall score was 0.9568. |
| Researcher Affiliation | Academia | Sarath Sreedharan, Alberto Olmo Hernandez, Aditya Prasad Mishra and Subbarao Kambhampati School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ 85281 USA |
| Pseudocode | No | The paper presents mathematical formulations for functions and optimization problems but does not include any clearly labeled 'Algorithm' or 'Pseudocode' blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | For simulations, we used a slightly modiļ¬ed versions of the Taxi domain [Dietterich, 1998] (of size 6*6), the Four rooms domain [Sutton et al., 1999] (of size 9*9) and the warehouse scenario (of size 9*9) described before (implemented using the Simple RL framework [Abel, 2019]). For the study, we recruited 45 master turkers from the Amazon Mechanical Turk. Each participant was provided with the URL to a website (https://goo.gl/Hun3ce) where they could view and label various robot behaviors. |
| Dataset Splits | Yes | We found the model to have an average 10-fold cross validation score of 0.935. For a randomly generated train and test split (where the test split was 10% and contained around 7% inexpicable labels) the precision score was 0.9637 and the recall score was 0.9568. In each Warehouse and Four rooms test instance, we collected 900 unique data points as training set and 100 data points as the test set. Due to the complexity of the taxi domain, we generated less data points (...) and used close to 220 unique points as training data and on average 28 data points as the test set. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU/CPU models, processor types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions the 'Simple RL framework' and 'decision tree learner' but does not provide specific version numbers for these or any other software dependencies required to replicate the experiments. |
| Experiment Setup | No | The paper describes the general setup of the simulations and user studies, including data collection and the type of learning model used. However, it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size) or other training configurations. |