Human-AI Collaboration with Bandit Feedback
Authors: Ruijiang Gao, Maytal Saar-Tsechansky, Maria De-Arteaga, Ligong Han, Min Kyung Lee, Matthew Lease
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our proposed methods using both synthetic and real human responses, and find that our methods outperform both the algorithm and the human when they each make decisions on their own. We empirically demonstrate the performance of the proposed solutions for HAI-BLBF on multi-label datasets converted to reflect decision and outcomes, and using both synthetic and real human labels. We investigate the limitations of our proposed hybrid team through ablation studies on model capacities... |
| Researcher Affiliation | Academia | 1University of Texas at Austin 2Rutgers University {ruijiang, ml}@utexas.edu, maytal@mail.utexas.edu, dearteaga@mccombs.utexas.edu, lh599@scarletmail.rutgers.edu, minkyung.lee@austin.utexas.edu |
| Pseudocode | No | The paper describes its methods and objectives using mathematical formulations and textual descriptions, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and Appendix is available at https://github.com/ ruijiang81/hai-blbf |
| Open Datasets | Yes | We show our experimental results for simulated human decision models on two multi-label datasets, Scene and TMC from LIBSVM repository [Elisseeff and Weston, 2002; Boutell et al., 2004], which are used for semantic scene and text classification. We evaluate our approach with real human decisions, using the data used in [Li et al., 2018] for Multi Label Learning (MLC) and a sentiment analysis dataset (Focus) [Rzhetsky et al., 2009] from crowd workers. |
| Dataset Splits | No | The paper mentions using a “test set ratio as 15%” and performing “grid search on the training data” but does not explicitly state a separate validation set split or its size. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, memory specifications, or cloud resources used for running the experiments. |
| Software Dependencies | No | The paper mentions software components such as “three-layer neural network”, “Adam”, and “random forest (default implementation in scikit-learn package)” but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For all of the main experiments, the model is a three-layer neural network, the router model is another threelayer neural network. We use Adam [Kingma and Ba, 2014] with learning rate of 0.001 for optimization and both policy and router models are deterministic at testing time. We use truncated importance sampling estimator and set the truncation threshold as 10, and we select the baseline in [Joachims et al., 2018] through a grid search over [0, 0.2, 0.4, 0.6, 0.8] on the training data. The logging probability are estimated by an additional random forest model trained on observational data. We train each method enough epochs until convergence. In the main results, we set the human decision cost at C(x) = C = 0.3, and we later vary this cost in the ablation studies. Each experiment is run over ten repetitions and we report the average reward and standard error in Table 1. |