Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Stay on Topic with Classifier-Free Guidance
Authors: Guillaume Sanchez, Alexander Spangher, Honglu Fan, Elad Levi, Stella Biderman
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate across a wide array of benchmarks that CFG can be used broadly as an inference-time technique in pure language modeling. We show that CFG (1) improves the performance of Pythia, GPT-2 and LLa MAfamily models across a broad set of Q&A, reasoning and code generation tasks, achieving SOTA on LAMBADA with LLa MA-7B over Pa LM540B; |
| Researcher Affiliation | Collaboration | 1Light On, France (work done while working at Hexaglobe) 2Eleuther AI 3Information Sciences Institute, University of Southern California 4University of Geneva 5Sightful. |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our work has been directly incorporated into leading open-source libraries: Huggingface and llama.cpp. |
| Open Datasets | Yes | We run these models on a sample dataset of 32, 902 data-points from P3 (Sanh et al., 2021). |
| Dataset Splits | No | The paper does not explicitly state specific training, validation, or test dataset splits (percentages or counts) for the datasets used. |
| Hardware Specification | No | The paper mentions "He took care of running the experiments of Section 3.1 thanks to his access to Core Weave and Stability s computing cluster." but does not provide specific hardware details such as GPU/CPU models or memory. |
| Software Dependencies | No | The paper mentions using "Eleuther AI s Language Model Evaluation Harness (Gao et al., 2021)" but does not specify its version number or any other software dependencies with version information. |
| Experiment Setup | Yes | We test different CFG strengths3 and different temperatures, evaluating at pass@k for k = 1, 10, 100 4. We show the results for temperature= 0.2 in Table 25. Footnote 3: γ = 1.0, 1.1, 1.25, 1.5, 1.75, 2.0 |