reproducibilityindex.ai

Provably Efficient Interactive-Grounded Learning with Personalized Reward

Authors: Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results showcase the importance of using our Lipschitz reward estimator and the overall effectiveness of our algorithms.In Section 4, we implement both algorithms and apply them to both an image classification dataset and a conversational dataset. The empirical performance showcases the effectiveness of our algorithm and the importance of using the Lipschitz reward estimator.We first conduct experiments on MNIST dataset and implement both the offpolicy algorithm Algorithm 1 and the on-policy algorithm Algorithm 2.
Researcher Affiliation	Collaboration	Mengxiao Zhang University of Iowa mengxiao-zhang@uiowa.edu, Yuheng Zhang University of Illinois Urbana-Champaign yuhengz2@illinois.edu, Haipeng Luo University of Southern California haipengl@usc.edu, Paul Mineiro Microsoft Research pmineiro@microsoft.com
Pseudocode	Yes	Algorithm 1 Off-Policy IGL, Algorithm 2 On-policy IGL
Open Source Code	No	The paper does not explicitly state that its code is open-source or provide a direct link to its own repository. It references third-party libraries but not its own implementation code.
Open Datasets	Yes	We first conduct experiments on MNIST dataset and implement both the offpolicy algorithm Algorithm 1 and the on-policy algorithm Algorithm 2., Dataset Construction Our dataset is constructed as follows. Specifically, we construct our question set S = {qi}i [20000] from Chatbot Arena datasets [Zheng et al., 2024].
Dataset Splits	No	The paper mentions a test set but does not explicitly provide details about a validation set split for its experiments.
Hardware Specification	Yes	We run the experiments on one NVIDIA Ge Force RTX 2080 Ti., We successfully learn bh using one A100 GPU within 6 hours., This process is done on one A100 GPU within 3 hours.
Software Dependencies	No	The paper mentions frameworks like PyTorch, Llama3-8B-Instruct, and PEFT, but does not consistently provide specific version numbers for these software dependencies, only general citations or access dates.
Experiment Setup	Yes	For both algorithms, we set the number of exploration samples N = 5000 and pick the parameter σ {0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3} and θ α { 1 2 }., For the on-policy algorithm, we use a time-varying exploration parameter γt = Kt as suggested by Foster and Krishnamurthy [2021]., After obtaining bh, we construct the reward estimator G(bha(x, y), θ α σ, σ) with σ = 0.1, α = K 2, and θ = 1.