reproducibilityindex.ai

Human-AI Collaboration with Bandit Feedback

Authors: Ruijiang Gao, Maytal Saar-Tsechansky, Maria De-Arteaga, Ligong Han, Min Kyung Lee, Matthew Lease

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our proposed methods using both synthetic and real human responses, and ﬁnd that our methods outperform both the algorithm and the human when they each make decisions on their own. We empirically demonstrate the performance of the proposed solutions for HAI-BLBF on multi-label datasets converted to reﬂect decision and outcomes, and using both synthetic and real human labels. We investigate the limitations of our proposed hybrid team through ablation studies on model capacities...
Researcher Affiliation	Academia	1University of Texas at Austin 2Rutgers University {ruijiang, ml}@utexas.edu, maytal@mail.utexas.edu, dearteaga@mccombs.utexas.edu, lh599@scarletmail.rutgers.edu, minkyung.lee@austin.utexas.edu
Pseudocode	No	The paper describes its methods and objectives using mathematical formulations and textual descriptions, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and Appendix is available at https://github.com/ ruijiang81/hai-blbf
Open Datasets	Yes	We show our experimental results for simulated human decision models on two multi-label datasets, Scene and TMC from LIBSVM repository [Elisseeff and Weston, 2002; Boutell et al., 2004], which are used for semantic scene and text classiﬁcation. We evaluate our approach with real human decisions, using the data used in [Li et al., 2018] for Multi Label Learning (MLC) and a sentiment analysis dataset (Focus) [Rzhetsky et al., 2009] from crowd workers.
Dataset Splits	No	The paper mentions using a “test set ratio as 15%” and performing “grid search on the training data” but does not explicitly state a separate validation set split or its size.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU/CPU models, memory specifications, or cloud resources used for running the experiments.
Software Dependencies	No	The paper mentions software components such as “three-layer neural network”, “Adam”, and “random forest (default implementation in scikit-learn package)” but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For all of the main experiments, the model is a three-layer neural network, the router model is another threelayer neural network. We use Adam [Kingma and Ba, 2014] with learning rate of 0.001 for optimization and both policy and router models are deterministic at testing time. We use truncated importance sampling estimator and set the truncation threshold as 10, and we select the baseline in [Joachims et al., 2018] through a grid search over [0, 0.2, 0.4, 0.6, 0.8] on the training data. The logging probability are estimated by an additional random forest model trained on observational data. We train each method enough epochs until convergence. In the main results, we set the human decision cost at C(x) = C = 0.3, and we later vary this cost in the ablation studies. Each experiment is run over ten repetitions and we report the average reward and standard error in Table 1.