reproducibilityindex.ai

Instructive Decoding: Instruction-Tuned Large Language Models are Self-Refiner from Noisy Instructions

Authors: Taehyeon Kim, Joonkee Kim, Gihun Lee, Se-Young Yun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments across a spectrum of such noisy instructions... Our approach achieves considerable performance gains across various instruction-tuned models and tasks... Experiments on unseen task generalization with SUPNATINST (Wang et al., 2022c) and UNNATINST (Honovich et al., 2022) heldout datasets show that instruction-tuned models enhanced by ID consistently outperform baseline models across various setups.
Researcher Affiliation	Academia	Taehyeon Kim , Joonkee Kim , Gihun Lee , and Se-Young Yun KAIST AI
Pseudocode	Yes	Algorithm 1: Instructive Decoding INPUT : Language model Mθ, base instruction sequence I, noisy instruction sequence I, initial prompt sequence x and target sequence length T, smoothing coefficient ϵ. 1: Initialize t 1 2: while t < T do 3: zt, zt Mθ(yt\|I, x, y<t), Mθ(yt\| I, x, y<t) 4: yt = arg max(SOFTMAX[zt ϵ zt]) 5: set t t + 1 6: end while
Open Source Code	No	The paper mentions publicly available checkpoints for pre-trained models (e.g., Tk-Instruct, T0-3B, Alpaca-7B) and states 'Tk-XL and Tk-XXL come from publicly available checkpoints', but it does not provide a link or explicit statement for its own methodology's source code.
Open Datasets	Yes	For our experiments, two datasets are utilized: SUPNATINST (Wang et al., 2022c) and UNNATINST (Honovich et al., 2022).
Dataset Splits	No	These models are trained across 757 english tasks from the SUPNATINST training split over 2 epochs, with each task comprising 100 samples. Our evaluation primarily involves three sizes of Tk-Instruct models... Due to the absence of an official validation task, training is conducted without splits and the last checkpoint is used for experiments.
Hardware Specification	No	The paper does not specify the hardware used for running experiments (e.g., CPU, GPU models, or cloud computing instances).
Software Dependencies	No	The paper mentions software like T5-LM, LLaMA, and Natural Language Toolkit (NLTK), but it does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch/TensorFlow).
Experiment Setup	Yes	The models are trained with a batch size of 16, for 2 epochs, using a learning rate of 5e-5. For both training and evaluation, the combined maximum length of demonstrations and contexts is set to 1,024, while the maximum generation length is limited to 128. ... Although our typical choice for ϵ was 0.3, we evaluated ID-null across a range of ϵ values, spanning from -0.5 to 0.5 at 0.01 intervals. ... The parameter α, which constrains the model s confidence, is set to 0.1 as in Li et al. (2022).