Instructive Decoding: Instruction-Tuned Large Language Models are Self-Refiner from Noisy Instructions
Authors: Taehyeon Kim, Joonkee Kim, Gihun Lee, Se-Young Yun
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments across a spectrum of such noisy instructions... Our approach achieves considerable performance gains across various instruction-tuned models and tasks... Experiments on unseen task generalization with SUPNATINST (Wang et al., 2022c) and UNNATINST (Honovich et al., 2022) heldout datasets show that instruction-tuned models enhanced by ID consistently outperform baseline models across various setups. |
| Researcher Affiliation | Academia | Taehyeon Kim , Joonkee Kim , Gihun Lee , and Se-Young Yun KAIST AI |
| Pseudocode | Yes | Algorithm 1: Instructive Decoding INPUT : Language model Mθ, base instruction sequence I, noisy instruction sequence I, initial prompt sequence x and target sequence length T, smoothing coefficient ϵ. 1: Initialize t 1 2: while t < T do 3: zt, zt Mθ(yt|I, x, y<t), Mθ(yt| I, x, y<t) 4: yt = arg max(SOFTMAX[zt ϵ zt]) 5: set t t + 1 6: end while |
| Open Source Code | No | The paper mentions publicly available checkpoints for pre-trained models (e.g., Tk-Instruct, T0-3B, Alpaca-7B) and states 'Tk-XL and Tk-XXL come from publicly available checkpoints', but it does not provide a link or explicit statement for its own methodology's source code. |
| Open Datasets | Yes | For our experiments, two datasets are utilized: SUPNATINST (Wang et al., 2022c) and UNNATINST (Honovich et al., 2022). |
| Dataset Splits | No | These models are trained across 757 english tasks from the SUPNATINST training split over 2 epochs, with each task comprising 100 samples. Our evaluation primarily involves three sizes of Tk-Instruct models... Due to the absence of an official validation task, training is conducted without splits and the last checkpoint is used for experiments. |
| Hardware Specification | No | The paper does not specify the hardware used for running experiments (e.g., CPU, GPU models, or cloud computing instances). |
| Software Dependencies | No | The paper mentions software like T5-LM, LLaMA, and Natural Language Toolkit (NLTK), but it does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch/TensorFlow). |
| Experiment Setup | Yes | The models are trained with a batch size of 16, for 2 epochs, using a learning rate of 5e-5. For both training and evaluation, the combined maximum length of demonstrations and contexts is set to 1,024, while the maximum generation length is limited to 128. ... Although our typical choice for ϵ was 0.3, we evaluated ID-null across a range of ϵ values, spanning from -0.5 to 0.5 at 0.01 intervals. ... The parameter α, which constrains the model s confidence, is set to 0.1 as in Li et al. (2022). |