Fast yet Safe: Early-Exiting with Risk Control

Authors: Metod Jazbec, Alexander Timans, Tin Hadži Veljković, Kaspar Sakmann, Dan Zhang, Christian Andersson Naesseth, Eric Nalisnick

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our insights on a range of vision and language tasks, demonstrating that risk control can produce substantial computational savings, all the while preserving user-specified performance goals.
Researcher Affiliation Collaboration 1Uv A-Bosch Delta Lab, University of Amsterdam 2Bosch Center for AI, Robert Bosch Gmb H 3Johns Hopkins University
Pseudocode Yes Algorithm 1: Risk control for EENNs via UCB (Prop. 2) input : EENN ˆpλ, Dcal, ϵ, δ, loss function ℓ, grid step output : ˆλUCB grid = np.arange(1,0,) # Construct the UCB (Eq. 11) UCB = np.ones(len(grid)) for i,λ in grid do L = ℓ(ˆpλ, Dcal) ℓ(ˆp L, Dcal), 0 # (n,) UCB[i] = WSR(L, δ) # Algorithm 2
Open Source Code Yes Our code is publicly available at https://github.com/metodj/RC-EENN.
Open Datasets Yes Observed samples from P are split into disjoint train, calibration and test sets, denoted Dtrain, Dcal, and Dtest. We focus on the Image Net dataset [19]. We evaluate our approaches on Cityscapes validation data. we replicate the main experiments from Schuster et al. [65] (CALM), using their earlyexit version of the T5 model [57] for text summarization on CNN/DM [33] and question answering on SQu AD [58]. Our results on the Celeb A dataset [49].
Dataset Splits Yes Observed samples from P are split into disjoint train, calibration and test sets, denoted Dtrain, Dcal, and Dtest. We evaluate our approaches on Cityscapes validation data (80% Dcal, 20% Dtest); in addition, we finetune and evaluate ADP-C on a subset of the GTA5 dataset [59] in D.2. Similarly, we randomly select a subset of 1000 images from the GTA5 validation set and evaluate our finetuned model using 80% Dcal (i.e., 800 images) and 20% Dtest (i.e., 200 images).
Hardware Specification Yes All our experiments can be performed and replicated on a single A100 GPU with experiment runtimes of <1 day.
Software Dependencies No The paper mentions its code is publicly available, but does not explicitly list specific software dependencies with version numbers within the text.
Experiment Setup Yes For risk control with high probability, we set δ = 0.1 (i.e., 90 % probability). For all models, we either work with the publicly available pretrained checkpoints or train the models ourselves, closely following the original implementation details. For this purpose, we employ the original training script and training parameters (e.g., learning rate, batch size, etc.).