Linguistic Calibration of Long-Form Generations

Authors: Neil Band, Xuechen Li, Tengyu Ma, Tatsunori Hashimoto

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section empirically validates our training and evaluation framework for linguistic calibration
Researcher Affiliation Academia Neil Band 1 Xuechen Li 1 Tengyu Ma 1 Tatsunori Hashimoto 1 1Department of Computer Science, Stanford University.
Pseudocode Yes Algorithm 1: Decision-Based RL with a Surrogate Reader
Open Source Code Yes We release code at github.com/tatsu-lab/linguistic calibration.
Open Datasets Yes Our training framework and all baselines use examples from the Trivia QA (Joshi et al., 2017) unfiltered.nocontext subset from Hugging Face Datasets (Lhoest et al., 2021). These examples are randomly assigned to the following splits: SFT (10000 examples): used for summary distillation and the other SFT baselines (Factuality SFT, Claude Distill).
Dataset Splits Yes These examples are randomly assigned to the following splits: SFT (10000 examples): used for summary distillation and the other SFT baselines (Factuality SFT, Claude Distill). Prompt Validation (1000 examples): used for all ICL-based baselines and to construct ICL examples for the simulated reader, which uses an API-based LLM. Reward Model (20000 examples): used to train surrogate reader for LC, and binary reward model for Factuality RL baseline. PPO (40000 examples): used for PPO with LC RL and Factuality RL methods. PPO Validation (1000 examples): during PPO, we evaluate reward model rewards on this split and store checkpoints every 20 steps. Validation (1000 examples): used for tuning evaluation temperature and model selection for RL methods (described below).
Hardware Specification Yes We use a standard implementation of PPO from Dubois et al. (2023) and train with 8 80GB A100 GPUs using Flash Attention 2 (Dao et al., 2022; Dao, 2023) and Py Torch FSDP (Zhao et al., 2023).
Software Dependencies Yes We use a standard implementation of PPO from Dubois et al. (2023) and train with 8 80GB A100 GPUs using Flash Attention 2 (Dao et al., 2022; Dao, 2023) and Py Torch FSDP (Zhao et al., 2023). ... We use the paged adamw 8bit optimizer (Dettmers et al., 2022) due to computational constraints.
Experiment Setup Yes We use a larger step batch size (512) with one optimization epoch per step for better training stability. We shorten query len to 128 tokens, because our PPO inputs were essentially all under this length. We use a slightly lower temperature during the rollout phase (0.7 instead of 1.0). ... We train for 1500 PPO steps. We tune the KL penalty and learning rate of both PPO methods across a wide range, ultimately finding that kl coef of 0.1 and learning rate of 1e-5 works best for both methods. For the LC RL objective, we find that λ = 5, C = 5 works well to enforce normalization of downstream forecasts and prevent reward hacking. In the log-loss term, we clip the probability of the ground-truth answer at ϵ =1e-4 for numerical stability.