Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Estimating the Hallucination Rate of Generative AI
Authors: Andrew Jesson, Nicolas Beltran Velez, Quentin Chu, Sweta Karlekar, Jannik Kossen, Yarin Gal, John P. Cunningham, David Blei
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate our method using large language models for synthetic regression and natural language ICL tasks. In Section 3, we empirically evaluate our methods. |
| Researcher Affiliation | Academia | Correspondence to EMAIL. Department of Statistics, Columbia University. Department of Computer Science, Columbia University. OATML, Department of Computer Science, University of Oxford. |
| Pseudocode | Yes | Algorithm 1 d PHR(x, Dn, pθ, M, N, K) and Algorithm 2 d THR(x, Dn, D, pθ, K) |
| Open Source Code | Yes | We provide implementations of the method proposed and experiments used in this paper in https://github.com/blei-lab/phr. |
| Open Datasets | Yes | We consider tasks defined by six datasets: Stanford Sentiment Treebank (SST2) [49], Subjectivity [50], AG News [6], Medical QP [51], RTE [52], and WNLI [53]. |
| Dataset Splits | Yes | For each task and context length n [2, 4, 8, 16, 32], we sample 50 random training datasets Dn, 50 evaluation datasets Deval, and 10 random test samples. |
| Hardware Specification | Yes | For our experiments, we used an internal cluster made up of A100s and RTX 8000s, each with 40 to 48 GB of memory. |
| Software Dependencies | No | We implement our neural process by modifying the Llama 2 architecture [42]...We run LLa MA-2-7B as an unquantized model (16-bit). |
| Experiment Setup | Yes | We train the model from random initialization on sequences of (x, y) pairs using a standard next token prediction objective and use the Adam W optimizer [84] with learning_rate = 0.0001, β1 = 0.9, β2 = 0.999, ϵ = 1e 8, and weight_decay = 1e 63. We use a cosine learning rate schedule, with warmup of 2000 steps, and decay final learning rate down to 10% of the peak learning rate. We set max_new_tokens = 200, temperature = 1 and top_p = 0.9 to provide a high level diversity and randomness to the generated output. |