Instructive Decoding: Instruction-Tuned Large Language Models are Self-Refiner from Noisy Instructions

Authors: Taehyeon Kim, Joonkee Kim, Gihun Lee, Se-Young Yun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments across a spectrum of such noisy instructions... Our approach achieves considerable performance gains across various instruction-tuned models and tasks... Experiments on unseen task generalization with SUPNATINST (Wang et al., 2022c) and UNNATINST (Honovich et al., 2022) heldout datasets show that instruction-tuned models enhanced by ID consistently outperform baseline models across various setups.
Researcher Affiliation Academia Taehyeon Kim , Joonkee Kim , Gihun Lee , and Se-Young Yun KAIST AI
Pseudocode Yes Algorithm 1: Instructive Decoding INPUT : Language model Mθ, base instruction sequence I, noisy instruction sequence I, initial prompt sequence x and target sequence length T, smoothing coefficient ϵ. 1: Initialize t 1 2: while t < T do 3: zt, zt Mθ(yt|I, x, y<t), Mθ(yt| I, x, y<t) 4: yt = arg max(SOFTMAX[zt ϵ zt]) 5: set t t + 1 6: end while
Open Source Code No The paper mentions publicly available checkpoints for pre-trained models (e.g., Tk-Instruct, T0-3B, Alpaca-7B) and states 'Tk-XL and Tk-XXL come from publicly available checkpoints', but it does not provide a link or explicit statement for its own methodology's source code.
Open Datasets Yes For our experiments, two datasets are utilized: SUPNATINST (Wang et al., 2022c) and UNNATINST (Honovich et al., 2022).
Dataset Splits No These models are trained across 757 english tasks from the SUPNATINST training split over 2 epochs, with each task comprising 100 samples. Our evaluation primarily involves three sizes of Tk-Instruct models... Due to the absence of an official validation task, training is conducted without splits and the last checkpoint is used for experiments.
Hardware Specification No The paper does not specify the hardware used for running experiments (e.g., CPU, GPU models, or cloud computing instances).
Software Dependencies No The paper mentions software like T5-LM, LLaMA, and Natural Language Toolkit (NLTK), but it does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch/TensorFlow).
Experiment Setup Yes The models are trained with a batch size of 16, for 2 epochs, using a learning rate of 5e-5. For both training and evaluation, the combined maximum length of demonstrations and contexts is set to 1,024, while the maximum generation length is limited to 128. ... Although our typical choice for ϵ was 0.3, we evaluated ID-null across a range of ϵ values, spanning from -0.5 to 0.5 at 0.01 intervals. ... The parameter α, which constrains the model s confidence, is set to 0.1 as in Li et al. (2022).