Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Thinking LLMs: General Instruction Following with Thought Generation

Authors: Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason E Weston, Sainbayar Sukhbaatar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that this procedure leads to superior performance on Alpaca Eval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks. We train on diverse user instructions and evaluate our models on Alpaca Eval and Arena-Hard, benchmarks that test general instruction following.
Researcher Affiliation	Collaboration	1Meta FAIR 2University of California, Berkeley 3New York University. Correspondence to: Tianhao Wu <EMAIL>, Sainbayar Sukhbaatar <EMAIL>.
Pseudocode	No	The paper describes the methodology using natural language and diagrams (Figure 1), but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets	Yes	For initial experiments, we use the synthetic instructions from Yuan et al. (2024b) for training. These instructions are generated from Llama-2-70B-Chat using 8-shot prompting consisting of random samples from the Open Assistant dataset (K opf et al., 2024). For later experiments, we switched to Ultra Feedback (Cui et al., 2023), which contains actual human instructions. ... For evaluation, we use two public benchmarks that test general instruction following capability: Alpaca Eval 2 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024). ... we evaluate our model on the GSM8K dataset (Cobbe et al., 2021) that contains grade-school math word problems.
Dataset Splits	Yes	Each training iteration uses 5000 instructions that were not part of the previous iterations. ... We train for 10 epochs in each iteration and select the best checkpoint using a validation set of 1500 prompts randomly sampled from Ultra Feedback. ... To obtain a more fine-grained evaluation, we build our own evaluation using Ultra Feedback. We take instructions not used in training, and assign them individually to one of 20 categories until each category has 200 samples.
Hardware Specification	No	The paper mentions models like Llama-3-8B-Instruct and larger models for comparison, and discusses compute requirements generally, but it does not specify any particular hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using Llama-3-8B-Instruct, Llama-2-70B-Chat, GPT-4, STE, and Armo RM, which are models or tools. However, it does not provide specific version numbers for any software libraries or dependencies used in their methodology (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We generate K = 8 responses per prompt using temperature 0.8 and top-p of 0.95. We train for 10 epochs in each iteration and select the best checkpoint using a validation set of 1500 prompts randomly sampled from Ultra Feedback. We perform up to 4 iterations. We usually set the length-control parameter ρ [0, 0.5], with 0 equivalent to no length-control.