Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Representation Consistency for Accurate and Coherent LLM Answer Aggregation
Authors: Junqi Jiang, Tom Bewley, Salim I. Amoukou, Francesco Leofante, Antonio Rago, Saumitra Mishra, Francesca Toni
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. In this section, we quantitatively evaluate the effectiveness of RC for improving LLMs task performance at test time. We first introduce our experiment setup. |
| Researcher Affiliation | Collaboration | 1 Imperial College London 2 J.P. Morgan AI Research 3 King s College London |
| Pseudocode | No | The paper describes methods using mathematical formulations and descriptive text but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We will release the code with detailed instructions. |
| Open Datasets | Yes | We test all methods using four reasoning datasets spanning diverse topics, Commonsense QA (CSQA) [62] for common sense reasoning, MMLU [26] for exam-style questions, Med MCQA [55] for medical domain-specific knowledge, and Hellaswag (HSwag) [75] for sentence completion tasks. |
| Dataset Splits | Yes | We use the test or eval sets of these datasets 1200 samples for CSQA, and 3000 samples for other datasets. For Gemma-2-27B-IT, we experiment with 1000 samples per dataset due to the heavy computation load. We use 50% data to find the optimal hyperparameters for each method (λ for RC-E, λ and model layer for RC), and report task performance results on the remaining 50%. |
| Hardware Specification | Yes | All experiments are executed on a Linux machine with 3 NVIDIA A100 GPUs, each with 80GB memory. |
| Software Dependencies | No | The paper mentions using the 'Python library sae-lens' and the 'transformers library' but does not specify their version numbers. |
| Experiment Setup | Yes | We use the following configurations (number of responses = number of prompt phrasings number of samples from each prompt): 12 responses with (12 1), (6 2), (4 3), (3 4), (2 6), (1 12), and 6 responses with (6 1), (3 2), (2 3), (1 6). Model generations from each prompt are sampled with 0.7 temperature for balanced randomness. We use 50% data to find the optimal hyperparameters for each method (λ for RC-E, λ and model layer for RC), and report task performance results on the remaining 50%. |