Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
Authors: Zhanhui Zhou, Zhixuan Liu, Jie Liu, Zhichen Dong, Chao Yang, Yu Qiao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned gpt2s to improve the alignment of large models without additional training. Crucially, in a more difficult instruction-following benchmark, Alpaca Eval 2.0, we show that reusing off-the-shelf small models (e.g., zephyr-7b-beta and its untuned version) can improve the length-controlled win rates of both white-box and black-box large models against gpt-4-turbo (e.g., 34.4% 37.9% for Llama-3-70B-Instruct and 16.0% 20.1% for gpt-3.5-turbo-instruct), despite the small models low win rates 10.0%. |
| Researcher Affiliation | Academia | Zhanhui Zhou , Zhixuan Liu , Jie Liu, Zhichen Dong, Chao Yang , Yu Qiao Shanghai Artificial Intelligence Laboratory Core Contribution, Corresponding Author asap.zzhou@gmail.com, yangchao@pjlab.org.cn |
| Pseudocode | Yes | Algorithm 1 Chunk-level Beam Search (CBS) |
| Open Source Code | Yes | Code: https://github.com/ZHZisZZ/weak-to-strong-search |
| Open Datasets | Yes | For controlled-sentiment generation, we reuse the publicly available distilbert-imdb to define the gold reward model rgold. Distilbert-imdb is a fine-tuned classifier p on the imdb dataset [53] to classify movie review sentiments. For summarization, we fit a reward model on the summarize_from_feedback dataset [2] as the gold reward model rgold. |
| Dataset Splits | No | The paper mentions 'validation accuracies' and 'validation prompts' but does not specify the explicit split percentages or absolute counts for training, validation, and test datasets, or refer to a standard split configuration with a citation. |
| Hardware Specification | Yes | Models are evaluated over 1000 test prompts, on one single NVIDIA A100 GPU. Model inference takes place on one single NVIDIA A100 GPU for 7B&8B and black-box models, and on four NVIDIA A100 GPUs for 70B models. |
| Software Dependencies | No | The paper mentions models like 'gpt2', 'Llama-2', 'zephyr-7b-beta', and the use of the 'Llama-2 tokenizer'. It also refers to the 'standard DPO pipeline'. However, it does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1). |
| Experiment Setup | Yes | We use fixed hyperparameters across all tested models. We use temperature T = 0.7, top-k = 50 and top-p = 1.0 when sampling from the language models. For weak-to-strong search (CBS), we use W, K, L = 4, 4, 5 (W: beam width, K: successors per state, L: chunk length). For Bo N, we use N = 16 for fair computational comparison with weak-to-strong search (i.e., WK = N). For EFT, we report the best results among β {1/4, 1/2, 1, 2, 4}. |