Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Is In-Context Learning Sufficient for Instruction Following in LLMs?
Authors: Hao Zhao, Maksym Andriushchenko, francesco croce, Nicolas Flammarion
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we show that, while effective, ICL alignment with URIAL still underperforms compared to instruction fine-tuning on the established benchmark MT-Bench, especially with more capable base LLMs. We then uncover the most relevant elements for successful in-context alignment, finding the crucial role of the decoding parameters. Based on these insights, we show that the approach of URIAL can indeed be improved by adding high-quality, possibly carefully selected via greedy search, demonstrations in context, getting closer to the performance of instruct models. Finally, we provide the first, to our knowledge, systematic comparison of ICL and instruction fine-tuning (IFT) for instruction following in the low data regime, where ICL can be a viable alternative to IFT. |
| Researcher Affiliation | Academia | Hao Zhao EPFL Maksym Andriushchenko EPFL Francesco Croce EPFL Nicolas Flammarion EPFL |
| Pseudocode | No | The paper mentions a "greedy algorithm" and describes its steps in text, notably in Section 3.2 "GREEDY SEARCH FOR EFFECTIVE DEMONSTRATIONS" and Appendix A.3 "GREEDY SEARCH". However, it does not include a clearly labeled pseudocode or algorithm block, presenting the procedure narratively rather than in a structured, code-like format. |
| Open Source Code | Yes | We provide our code at https://github.com/tml-epfl/icl-alignment. |
| Open Datasets | Yes | Table 1: Systematic comparison of URIAL to aligned models on MT-Bench across different base LLMs. Data. We adopt the Skill Mix (Kaur et al., 2024) dataset consisting 4,000 examples as the source of high-quality examples for our many-shot in-context alignment experiments. Alpaca Eval 2.0 (Li et al., 2023a) provides 805 test instructions, on which we generate new responses using the target model, and then calculate the score by competing with the baseline model (i.e., GPT-4-Turbo) judged by a designated automatic evaluator. Evol-Instruct-70k (Xu et al., 2024) contains 70k training examples with varying complexity and is well-known for the use to build the series of Wizard LM models. |
| Dataset Splits | Yes | MT-Bench (Zheng et al., 2023), consists of 80 high-quality and challenging questions that have two-round interaction, designed to examine the multi-turn conversation and instruction-following capability of models. Alpaca Eval 2.0 (Li et al., 2023a) provides 805 test instructions, on which we generate new responses using the target model, and then calculate the score by competing with the baseline model. ICL Random. We construct the set of in-context demonstrations based on the high-quality data we select from Skill Mix-4k or Evol-Instruct-70k dataset. Through randomly sampling from the high-quality dataset for multiple times, we generate a series of in-context demonstration sets that each contains N examples, where N {0, 7, 17, 27, 37, 47} for Mistral-7B-v0.2 model (32k context length) and N {0, 7, 17, 27, 37, 47, 97, 147, 197, 247} for Llama-3.1-8B model (128k context length), and insert the N demonstrations into the prompt template for ICL as shown in Fig 8. ... The average performance and standard deviations are computed over 5 random seeds. IFT. ...The training examples are randomly sampled from existing IFT datasets, Skill Mix-4k and Evol-Instruct-70k. ...We use 5 random seeds for Mistral-7B-v0.2 model and 3 random seeds for Llama-3.1-8B model due to a restricted compute budget. |
| Hardware Specification | No | The paper mentions the number of GPUs used for training in Table 5 (e.g., 2 GPUs for 3 data samples, 4 GPUs for 10-4000 data samples), but it does not specify the exact GPU models (e.g., NVIDIA A100, Tesla V100) or any other specific hardware components like CPU models or memory. |
| Software Dependencies | No | The paper describes the models, datasets, and evaluation metrics used, but it does not provide specific software names with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiments. |
| Experiment Setup | Yes | Concretely, we employ greedy decoding, i.e., temperature = 0.0, for all models, including base and instruction fine-tuned models, to maximize reproducibility and secure a fair and robust evaluation. Top-p = 1.0 is adopted to keep the full cumulative probability distribution. Besides, we use repetition penalty = 1.152 on base models to prevent degeneration. Table 5: Details of training hyperparameters for IFT experiments. This includes Data Size, # GPUs, Epochs (e.g., 6, 15), LR (e.g., 2e-6, 4e-6), LR Scheduler (Cosine), Batch Size (e.g., 2, 8, 128), Context Win. Len. (2048), WD (0.01), Warmup Rate (0.03). |