Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

Authors: Kenneth Li, Samy Jelassi, Hugh Zhang, Sham M. Kakade, Martin Wattenberg, David Brandfonbrener

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we evaluate Q-probes with access to ground truth rewards on coding benchmarks our best Q-probe achieves 17% higher accuracy on MBPP (Austin et al., 2021) compared to the base Code-LLa MA-7B (Roziere et al., 2023) and outperforms finetuning on successes with LORA (Hu et al., 2021) and few shot prompting. Although again, we emphasize that Q-probes are not mutually exclusive with these other techniques and can be combined for even better results. One key component of the results is a novel objective for training the Q-probes via direct policy learning.
Researcher Affiliation Academia 1John A. Paulson School Of Engineering And Applied Sciences, Harvard University 2Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University 3Center of Mathematical Sciences and Applications, Harvard University. Correspondence to: Kenneth Li <ke li@g.harvard.edu>, David Brandfonbrener <david brandfonbrener@g.harvard.edu>.
Pseudocode No The paper describes the Q-probe policy and training algorithms in paragraph form and through mathematical equations, but it does not include any explicitly labeled “Pseudocode” or “Algorithm” blocks or figures.
Open Source Code Yes Code: https: //github.com/likenneth/q_probe.
Open Datasets Yes We train models on the MBPP train set which consists of 464 programming prompts with test cases. We consider the reward to be 1 if all tests are passed and 0 otherwise. For each training prompt, we can generate as many completions as we want from the base model to automatically label with these rewards. We sample from the base model with temperature 0.8 and top-p 0.95, following (Roziere et al., 2023), unless otherwise noted.
Dataset Splits No The paper specifies training and test sets (e.g., MBPP train set, MBPP test set), but it does not explicitly mention or detail a separate validation set or its split percentage/size for reproducing experiments.
Hardware Specification No The paper mentions “Kempner Institute computing resources enabled this work” but does not provide specific details about the hardware used (e.g., GPU models, CPU types, memory, or cloud instance specifications) for running the experiments.
Software Dependencies No The paper mentions “the optimizer is Adam” but does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments.
Experiment Setup Yes Unless otherwise stated, the Q-probe is a 1-layer (linear) probe, the optimizer is Adam (Kingma and Ba, 2014), the learning rate is 5e 5, the batch size is 1000, and we train for 150 epochs. For the PG loss, we need multiple samples from one prompt to compute the loss. To do this, we group samples by prompt and reshape the batch so it contains 100 problems with 10 samples from each problem.