Off-Policy Evaluation for Human Feedback
Authors: Qitong Gao, Ge Gao, Juncheng Dong, Vahid Tarokh, Min Chi, Miroslav Pajic
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach has been tested over two real-world experiments, adaptive in-vivo neurostimulation and intelligent tutoring, as well as in a simulation environment (visual Q&A). Results show that our approach significantly improves the performance toward estimating HF signals accurately, compared to directly applying (variants of) existing OPE methods. |
| Researcher Affiliation | Academia | Duke University. Durham, NC, USA. Contact: {qitong.gao, miroslav.pajic}@duke.edu. North Carolina State University. Raleigh, NC, USA. |
| Pseudocode | No | The paper does not contain any blocks labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code or an algorithm. |
| Open Source Code | Yes | Code implementations for all three components above can be found in the supplementary materials attached. |
| Open Datasets | Yes | We have also tested our method within a visual Q&A environment [10, 66], which follows similar mechanisms as in the two real-world experiments, i.e., two types of return signals are considered though no human participants are involved. |
| Dataset Splits | No | The paper describes the collection of 'offline trajectories' and 'target policies' for evaluation, and also mentions standard OPE methods that use 'train', 'validation', and 'test' in their names. However, it does not provide specific percentages or absolute counts for dataset splits like '80/10/10 split' for its own experiments. |
| Hardware Specification | Yes | All experimental workloads are distributed among 4 Nvidia RTX A5000 24GB and 3 Nvidia Quadro RTX 6000 24GB graphics cards. |
| Software Dependencies | No | The paper mentions that 'The implementations of downstream OPE estimators are built on top of [36]' and details the use of LSTMs and Adam optimizer. However, it does not provide specific version numbers for these or other key software components and libraries (e.g., Python, PyTorch/TensorFlow, CUDA versions). |
| Experiment Setup | Yes | The scale of this regularization is selected from C ={1e-04, 5e-03, 1e-03, 5e-02, 1e-02, 0.1, 1., 2., 5.}. Learning rate is tuned by grid search from {0.003, 0.001, 0.0007, 0.0005, 0.0003, 0.0001, 0.00005}. Exponential decay is applied to the learning rate, which decays the learning rate by 0.997 every iteration. The total number of training epoches is set to be 20 and minibatch size set to 64. Adam optimizer is used to perform gradient descent. L2 weight decay with coefficient 0.001 and batch normalization are applied to all hidden fully connected layers. |