Stratified Prediction-Powered Inference for Effective Hybrid Evaluation of Language Models
Authors: Adam Fisch, Joshua Maynez, R. Hofer, Bhuwan Dhingra, Amir Globerson, William W. Cohen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our stratified estimator, Strat PPI, to two baselines: (i) the classical estimate, which uses only the labeled data, Sn; and (ii) PPI++, which uses both Sn and Sn. All of our experiments focus on 1-d mean estimation. We explore three different allocation strategies for Strat PPI: the first is to set ρk = wk to be data proportional (Strat PPI Prop.), the second is to set ρk optimally via the oracle ρk = ρ k (Strat PPI Opt.), and the third is to use the approximation, ρk wkˆσk, in Example 2 for λk = 1 when confidence scores are available (Strat PPI Heur.). We use λ-tuning for both PPI++ and Strat PPI, as outlined in 4.2. Additional experimental results are given in Appendix C. |
| Researcher Affiliation | Industry | Adam Fisch , Joshua Maynez , R. Alex Hofer Bhuwan Dhingra Amir Globerson William W. Cohen Google Deep Mind Google Research {fisch,joshuahm,rofer,bdhingra,amirg,wcohen}@google.com |
| Pseudocode | Yes | Algorithm 1 Stratified prediction-powered inference for general M-estimators (Strat PPI) |
| Open Source Code | No | Code may be made available at a future date. |
| Open Datasets | Yes | Seahorse. The Seahorse dataset [11] focuses on multilingual summarization. |
| Dataset Splits | Yes | For each experiment, we sample N = 10,000 total predictions f( X) using ρ1 = ρ2 = 0.5, i.e., proportional to masses of the two hypothetical, equal-weight strata. We then vary the total number n of labeled examples Y , where the allocation is chosen according to ρ (which differs depending on if we are using Strat PPI Prop. or Strat PPI Opt.). |
| Hardware Specification | No | Compute resources required are very light, as no model training is performed. |
| Software Dependencies | No | The paper does not specify version numbers for any software, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We assume that predictions are formed as f(Xik) = Yik + µk + σkϵik, where ϵik N(0, 1). ... We test three different scenarios: (i) where the two strata are homogeneous with µ1 = µ2 and σ1 = σ2; (ii) where the two strata have different prediction biases, µ1 = µ2; and (iii) where the two strata have different prediction noise levels, σ1 = σ2. |