reproducibilityindex.ai

Off-Policy Evaluation for Human Feedback

Authors: Qitong Gao, Ge Gao, Juncheng Dong, Vahid Tarokh, Min Chi, Miroslav Pajic

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach has been tested over two real-world experiments, adaptive in-vivo neurostimulation and intelligent tutoring, as well as in a simulation environment (visual Q&A). Results show that our approach significantly improves the performance toward estimating HF signals accurately, compared to directly applying (variants of) existing OPE methods.
Researcher Affiliation	Academia	Duke University. Durham, NC, USA. Contact: {qitong.gao, miroslav.pajic}@duke.edu. North Carolina State University. Raleigh, NC, USA.
Pseudocode	No	The paper does not contain any blocks labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code or an algorithm.
Open Source Code	Yes	Code implementations for all three components above can be found in the supplementary materials attached.
Open Datasets	Yes	We have also tested our method within a visual Q&A environment [10, 66], which follows similar mechanisms as in the two real-world experiments, i.e., two types of return signals are considered though no human participants are involved.
Dataset Splits	No	The paper describes the collection of 'offline trajectories' and 'target policies' for evaluation, and also mentions standard OPE methods that use 'train', 'validation', and 'test' in their names. However, it does not provide specific percentages or absolute counts for dataset splits like '80/10/10 split' for its own experiments.
Hardware Specification	Yes	All experimental workloads are distributed among 4 Nvidia RTX A5000 24GB and 3 Nvidia Quadro RTX 6000 24GB graphics cards.
Software Dependencies	No	The paper mentions that 'The implementations of downstream OPE estimators are built on top of [36]' and details the use of LSTMs and Adam optimizer. However, it does not provide specific version numbers for these or other key software components and libraries (e.g., Python, PyTorch/TensorFlow, CUDA versions).
Experiment Setup	Yes	The scale of this regularization is selected from C ={1e-04, 5e-03, 1e-03, 5e-02, 1e-02, 0.1, 1., 2., 5.}. Learning rate is tuned by grid search from {0.003, 0.001, 0.0007, 0.0005, 0.0003, 0.0001, 0.00005}. Exponential decay is applied to the learning rate, which decays the learning rate by 0.997 every iteration. The total number of training epoches is set to be 20 and minibatch size set to 64. Adam optimizer is used to perform gradient descent. L2 weight decay with coefficient 0.001 and batch normalization are applied to all hidden fully connected layers.