Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models

Authors: Zhanhui Zhou, Zhixuan Liu, Jie Liu, Zhichen Dong, Chao Yang, Yu Qiao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned gpt2s to improve the alignment of large models without additional training. Crucially, in a more difficult instruction-following benchmark, Alpaca Eval 2.0, we show that reusing off-the-shelf small models (e.g., zephyr-7b-beta and its untuned version) can improve the length-controlled win rates of both white-box and black-box large models against gpt-4-turbo (e.g., 34.4% 37.9% for Llama-3-70B-Instruct and 16.0% 20.1% for gpt-3.5-turbo-instruct), despite the small models low win rates 10.0%.
Researcher Affiliation Academia Zhanhui Zhou , Zhixuan Liu , Jie Liu, Zhichen Dong, Chao Yang , Yu Qiao Shanghai Artificial Intelligence Laboratory Core Contribution, Corresponding Author asap.zzhou@gmail.com, yangchao@pjlab.org.cn
Pseudocode Yes Algorithm 1 Chunk-level Beam Search (CBS)
Open Source Code Yes Code: https://github.com/ZHZisZZ/weak-to-strong-search
Open Datasets Yes For controlled-sentiment generation, we reuse the publicly available distilbert-imdb to define the gold reward model rgold. Distilbert-imdb is a fine-tuned classifier p on the imdb dataset [53] to classify movie review sentiments. For summarization, we fit a reward model on the summarize_from_feedback dataset [2] as the gold reward model rgold.
Dataset Splits No The paper mentions 'validation accuracies' and 'validation prompts' but does not specify the explicit split percentages or absolute counts for training, validation, and test datasets, or refer to a standard split configuration with a citation.
Hardware Specification Yes Models are evaluated over 1000 test prompts, on one single NVIDIA A100 GPU. Model inference takes place on one single NVIDIA A100 GPU for 7B&8B and black-box models, and on four NVIDIA A100 GPUs for 70B models.
Software Dependencies No The paper mentions models like 'gpt2', 'Llama-2', 'zephyr-7b-beta', and the use of the 'Llama-2 tokenizer'. It also refers to the 'standard DPO pipeline'. However, it does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1).
Experiment Setup Yes We use fixed hyperparameters across all tested models. We use temperature T = 0.7, top-k = 50 and top-p = 1.0 when sampling from the language models. For weak-to-strong search (CBS), we use W, K, L = 4, 4, 5 (W: beam width, K: successors per state, L: chunk length). For Bo N, we use N = 16 for fair computational comparison with weak-to-strong search (i.e., WK = N). For EFT, we report the best results among β {1/4, 1/2, 1, 2, 4}.