reproducibilityindex.ai

I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token

Authors: Roi Cohen, Konstantin Dobler, Eden Biran, Gerard de Melo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our proposed method across multiple model architectures and factual downstream tasks. We find that models trained with our method are able to express uncertainty in places where they would previously make mistakes while suffering only a small loss of encoded knowledge. We further perform extensive ablation studies of multiple variations of our approach and provide a detailed analysis of the precision-recall tradeoff of our method.
Researcher Affiliation	Academia	Roi Cohen HPI / University of Potsdam Roi.Cohen@hpi.de Konstantin Dobler HPI / University of Potsdam Konstantin.Dobler@hpi.de Eden Biran Tel Aviv University edenbiran@mail.tau.ac.il Gerard de Melo HPI / University of Potsdam Gerard.DeMelo@hpi.de
Pseudocode	No	The paper describes its objective function and training process mathematically and with descriptive text, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release our code and IDK-tuned model checkpoints at https://github.com/roi-hpi/IDK-token-tuning.
Open Datasets	Yes	For IDK-tuning Mistral-7B-v0.1, we train on data randomly sampled from The Pile [Gao et al., 2020]2 with a context length of 4,096. We consider the following datasets: LAMA [Petroni et al., 2019], Trivia QA [Joshi et al., 2017], and Pop QA [Mallen et al., 2022]. To evaluate multiple-choice question answering, we use EleutherAI s lm-evaluation-harness [Gao et al., 2023]. Specifically, we use ARC [Clark et al., 2018], Hella Swag [Zellers et al., 2019], MMLU [Hendrycks et al., 2020], Truthful QA [Lin et al., 2022a], Wino Grande [Sakaguchi et al., 2021], and GSM8k [Cobbe et al., 2021].
Dataset Splits	No	The paper mentions using a 'development set' for hyperparameter tuning ('To create a strong baseline, we search for the best threshold via hyperparameter tuning on the development set.') but does not specify a general train/validation/test split for the datasets used for model training or evaluation.
Hardware Specification	Yes	For IDK-tuning of Mistral-7B-v0.1, we use Nvidia H100 or A100 GPUs depending on availability. For IDK-tuning pythia-70m 2.8B, we use 1-4 Nvidia A6000 GPUs. For IDK-tuning of bert-base-cased, we use a single Nvidia A100 GPU.
Software Dependencies	No	The paper mentions using 'AdamW betas', 'bfloat16 and float16 mixed-precision training', and 'lm-evaluation-harness', but does not specify version numbers for any of these software components or frameworks.
Experiment Setup	Yes	We use a maximum learning rate of 4e-5 with a linear warmup for 10% of the training steps and a cosine decay down to 2e-6. We use a batch size of 256, weight decay of 0.05, gradient clipping of 1.0 and Adam W betas (0.9, 0.95). We train for 1,024 optimizer steps resulting in a total of 1B training tokens. For the pythia-70m 2.8B models, we use the same hyperparameters but reduce the context length to 2,048 to match the model s positional embeddings.