reproducibilityindex.ai

Model-Free Model Reconciliation

Authors: Sarath Sreedharan, Alberto Olmo Hernandez, Aditya Prasad Mishra, Subbarao Kambhampati

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we will evaluate our method on a set of standard MDP benchmarks and perform user studies to validate its viability (Section 6). We validated our approach on both simulations and on data collected from users. Figure 3 plots the average test accuracy for models trained with training sets of varying sizes. We found the model to have an average 10-fold cross validation score of 0.935. For a randomly generated train and test split (where the test split was 10% and contained around 7% inexpicable labels) the precision score was 0.9637 and the recall score was 0.9568.
Researcher Affiliation	Academia	Sarath Sreedharan, Alberto Olmo Hernandez, Aditya Prasad Mishra and Subbarao Kambhampati School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ 85281 USA
Pseudocode	No	The paper presents mathematical formulations for functions and optimization problems but does not include any clearly labeled 'Algorithm' or 'Pseudocode' blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	For simulations, we used a slightly modiﬁed versions of the Taxi domain [Dietterich, 1998] (of size 66), the Four rooms domain [Sutton et al., 1999] (of size 99) and the warehouse scenario (of size 9*9) described before (implemented using the Simple RL framework [Abel, 2019]). For the study, we recruited 45 master turkers from the Amazon Mechanical Turk. Each participant was provided with the URL to a website (https://goo.gl/Hun3ce) where they could view and label various robot behaviors.
Dataset Splits	Yes	We found the model to have an average 10-fold cross validation score of 0.935. For a randomly generated train and test split (where the test split was 10% and contained around 7% inexpicable labels) the precision score was 0.9637 and the recall score was 0.9568. In each Warehouse and Four rooms test instance, we collected 900 unique data points as training set and 100 data points as the test set. Due to the complexity of the taxi domain, we generated less data points (...) and used close to 220 unique points as training data and on average 28 data points as the test set.
Hardware Specification	No	The paper does not specify any hardware details such as GPU/CPU models, processor types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions the 'Simple RL framework' and 'decision tree learner' but does not provide specific version numbers for these or any other software dependencies required to replicate the experiments.
Experiment Setup	No	The paper describes the general setup of the simulations and user studies, including data collection and the type of learning model used. However, it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size) or other training configurations.