reproducibilityindex.ai

Self-Alignment with Instruction Backtranslation

Authors: Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Omer Levy, Luke Zettlemoyer, Jason E Weston, Mike Lewis

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. ... Finetuning LLa Ma on two iterations of our approach yields a model that outperforms all other LLa Ma-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.
Researcher Affiliation	Industry	Meta {xianl,jase,mikelewis}@meta.com
Pseudocode	No	The paper includes a figure (Figure 1) illustrating the overview of the method, but it does not contain any formal pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions the Alpaca Eval link (https://github.com/tatsu-lab/alpaca_eval) which is an evaluation tool, but it does not provide any link or explicit statement about releasing the source code for their own method (instruction backtranslation/Humpback).
Open Datasets	Yes	We use 3200 examples from the Open Assistant dataset (Köpf et al., 2023) as human-annotated seed data to train our models. ... We use the English portion of the Clueweb corpus as the source of unlabelled data (Overwijk et al., 2022).
Dataset Splits	No	The paper states it used a 'dev set' for evaluation ('We sample 256 prompts from them excluding those in the Alpaca Eval test set as a dev set.'), but it does not provide explicit training/validation/test splits of the datasets used for model training or hyperparameter tuning. It describes the total sizes of seed and augmented data.
Hardware Specification	No	The paper mentions 'LLa MA model with 7B, 33B and 65B parameters' but does not specify the actual hardware (e.g., GPU models, CPU types, or cloud instance types) used for finetuning or running experiments.
Software Dependencies	No	The paper mentions models like 'LLa MA', 'GPT-4', and 'text-davinci-003' but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in their implementation.
Experiment Setup	Yes	We use the same hyperparameters as existing supervised finetuning (SFT) methods (Zhou et al., 2023; Touvron et al., 2023a) for most models: learning rate 1e 5 which linearly decays to 9e 6 at the end of training, weight decay 0.1, batch size 32 (examples) and dropout 0.1. For finetuning with less than 3000 examples we use batch size 8 (more details in Table 18).