Provably Efficient Interactive-Grounded Learning with Personalized Reward
Authors: Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results showcase the importance of using our Lipschitz reward estimator and the overall effectiveness of our algorithms.In Section 4, we implement both algorithms and apply them to both an image classification dataset and a conversational dataset. The empirical performance showcases the effectiveness of our algorithm and the importance of using the Lipschitz reward estimator.We first conduct experiments on MNIST dataset and implement both the offpolicy algorithm Algorithm 1 and the on-policy algorithm Algorithm 2. |
| Researcher Affiliation | Collaboration | Mengxiao Zhang University of Iowa mengxiao-zhang@uiowa.edu, Yuheng Zhang University of Illinois Urbana-Champaign yuhengz2@illinois.edu, Haipeng Luo University of Southern California haipengl@usc.edu, Paul Mineiro Microsoft Research pmineiro@microsoft.com |
| Pseudocode | Yes | Algorithm 1 Off-Policy IGL, Algorithm 2 On-policy IGL |
| Open Source Code | No | The paper does not explicitly state that its code is open-source or provide a direct link to its own repository. It references third-party libraries but not its own implementation code. |
| Open Datasets | Yes | We first conduct experiments on MNIST dataset and implement both the offpolicy algorithm Algorithm 1 and the on-policy algorithm Algorithm 2., Dataset Construction Our dataset is constructed as follows. Specifically, we construct our question set S = {qi}i [20000] from Chatbot Arena datasets [Zheng et al., 2024]. |
| Dataset Splits | No | The paper mentions a test set but does not explicitly provide details about a validation set split for its experiments. |
| Hardware Specification | Yes | We run the experiments on one NVIDIA Ge Force RTX 2080 Ti., We successfully learn bh using one A100 GPU within 6 hours., This process is done on one A100 GPU within 3 hours. |
| Software Dependencies | No | The paper mentions frameworks like PyTorch, Llama3-8B-Instruct, and PEFT, but does not consistently provide specific version numbers for these software dependencies, only general citations or access dates. |
| Experiment Setup | Yes | For both algorithms, we set the number of exploration samples N = 5000 and pick the parameter σ {0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3} and θ α { 1 2 }., For the on-policy algorithm, we use a time-varying exploration parameter γt = Kt as suggested by Foster and Krishnamurthy [2021]., After obtaining bh, we construct the reward estimator G(bha(x, y), θ α σ, σ) with σ = 0.1, α = K 2, and θ = 1. |