Fast yet Safe: Early-Exiting with Risk Control
Authors: Metod Jazbec, Alexander Timans, Tin Hadži Veljković, Kaspar Sakmann, Dan Zhang, Christian Andersson Naesseth, Eric Nalisnick
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate our insights on a range of vision and language tasks, demonstrating that risk control can produce substantial computational savings, all the while preserving user-specified performance goals. |
| Researcher Affiliation | Collaboration | 1Uv A-Bosch Delta Lab, University of Amsterdam 2Bosch Center for AI, Robert Bosch Gmb H 3Johns Hopkins University |
| Pseudocode | Yes | Algorithm 1: Risk control for EENNs via UCB (Prop. 2) input : EENN ˆpλ, Dcal, ϵ, δ, loss function ℓ, grid step output : ˆλUCB grid = np.arange(1,0,) # Construct the UCB (Eq. 11) UCB = np.ones(len(grid)) for i,λ in grid do L = ℓ(ˆpλ, Dcal) ℓ(ˆp L, Dcal), 0 # (n,) UCB[i] = WSR(L, δ) # Algorithm 2 |
| Open Source Code | Yes | Our code is publicly available at https://github.com/metodj/RC-EENN. |
| Open Datasets | Yes | Observed samples from P are split into disjoint train, calibration and test sets, denoted Dtrain, Dcal, and Dtest. We focus on the Image Net dataset [19]. We evaluate our approaches on Cityscapes validation data. we replicate the main experiments from Schuster et al. [65] (CALM), using their earlyexit version of the T5 model [57] for text summarization on CNN/DM [33] and question answering on SQu AD [58]. Our results on the Celeb A dataset [49]. |
| Dataset Splits | Yes | Observed samples from P are split into disjoint train, calibration and test sets, denoted Dtrain, Dcal, and Dtest. We evaluate our approaches on Cityscapes validation data (80% Dcal, 20% Dtest); in addition, we finetune and evaluate ADP-C on a subset of the GTA5 dataset [59] in D.2. Similarly, we randomly select a subset of 1000 images from the GTA5 validation set and evaluate our finetuned model using 80% Dcal (i.e., 800 images) and 20% Dtest (i.e., 200 images). |
| Hardware Specification | Yes | All our experiments can be performed and replicated on a single A100 GPU with experiment runtimes of <1 day. |
| Software Dependencies | No | The paper mentions its code is publicly available, but does not explicitly list specific software dependencies with version numbers within the text. |
| Experiment Setup | Yes | For risk control with high probability, we set δ = 0.1 (i.e., 90 % probability). For all models, we either work with the publicly available pretrained checkpoints or train the models ourselves, closely following the original implementation details. For this purpose, we employ the original training script and training parameters (e.g., learning rate, batch size, etc.). |