Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Benford’s Curse: Tracing Digit Bias to Numerical Hallucination in LLMs

Authors: Jiandong Shao, Yao Lu, Jianfei Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate six open-source LLMs on the proposed Digit Bias Benchmark, including models from LLa MA [30], Qwen [31], Gemma [32], OLMo [33] and Mistral [34] families. By design, the benchmark enforces a uniform digit distribution in its ground truth, so any deviation in the model s output distribution directly reflects inherent generation bias. As shown in Figure 3a, Mistral-7B exhibits a strong and consistent over-generation of smaller digits. For example, digit 1 often appears over 12% of the time, while digits such as 8 and 9 are severely underrepresented. This trend closely parallels the skew found in the pretraining corpus, reinforcing the hypothesis that the bias originates from corpus-level statistics.
Researcher Affiliation	Academia	Jiandong Shao1 , Yao Lu2, Jianfei Yang1 1Nanyang Technological University, 2University College London EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods using mathematical formulas and prose, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data is made available here: https://github.com/shamy28/Benford-Curse.
Open Datasets	Yes	To empirically assess whether the digit distribution in pretraining corpus aligns with Benford s Law, we analyze the Olmo-Mix-1124 corpus [20], a widelyused 22.4TB data collection curated for training open-source LLMs. primarily adapted from Deep Mind s Mathematics Dataset [29]. Table 7: The list of datasets used in this work. olmo-mix-1124 Link Open Data Commons License mathematics_dataset Link Apache license 2.0
Dataset Splits	No	To investigate whether skewed digit distribution in pretraining data leads to generation bias, we introduce the Digit Bias Benchmark, a suite of seven numerical reasoning tasks designed to yield uniformly distributed ground-truth digits. Each task contains over 1,000 examples, with answer sets carefully constructed to ensure uniform digit distribution: when pooling all digits from all positions across all answers within a task (e.g., the answer "132" contributes three digits: 1, 3, and 2), each digit 0-9 appears approximately 10% of the time.
Hardware Specification	Yes	All experiments presented in this paper were run on a cluster of four NVIDIA Ge Force RTX 3090 GPUs with 24GB of memory and using a single 24GB memory GPU.
Software Dependencies	No	The paper mentions using LLMs based on decoder-only Transformer architecture but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	To ensure the accuracy and reproducibility of all results, we employed greedy decoding for generation. Specifically, we compute the entropy of the model s output digit token distribution at each layer. Samples with entropy exceeding a threshold (e.g., >3.0 at layer 26) are flagged as potentially biased. we prune only the top 0.01% most digit-1-selective neurons, and activate this intervention only during the generation of digit tokens. Table 4: Prompt Templates Used in Identification Task Table 5: Prompt Templates Used in Digit Bias Benchmark