Estimating the Hallucination Rate of Generative AI

Authors: Andrew Jesson, Nicolas Beltran Velez, Quentin Chu, Sweta Karlekar, Jannik Kossen, Yarin Gal, John P. Cunningham, David Blei

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate our method using large language models for synthetic regression and natural language ICL tasks. In Section 3, we empirically evaluate our methods.
Researcher Affiliation Academia Correspondence to {adj2147, nb2838}@columbia.edu. Department of Statistics, Columbia University. Department of Computer Science, Columbia University. OATML, Department of Computer Science, University of Oxford.
Pseudocode Yes Algorithm 1 d PHR(x, Dn, pθ, M, N, K) and Algorithm 2 d THR(x, Dn, D, pθ, K)
Open Source Code Yes We provide implementations of the method proposed and experiments used in this paper in https://github.com/blei-lab/phr.
Open Datasets Yes We consider tasks defined by six datasets: Stanford Sentiment Treebank (SST2) [49], Subjectivity [50], AG News [6], Medical QP [51], RTE [52], and WNLI [53].
Dataset Splits Yes For each task and context length n [2, 4, 8, 16, 32], we sample 50 random training datasets Dn, 50 evaluation datasets Deval, and 10 random test samples.
Hardware Specification Yes For our experiments, we used an internal cluster made up of A100s and RTX 8000s, each with 40 to 48 GB of memory.
Software Dependencies No We implement our neural process by modifying the Llama 2 architecture [42]...We run LLa MA-2-7B as an unquantized model (16-bit).
Experiment Setup Yes We train the model from random initialization on sequences of (x, y) pairs using a standard next token prediction objective and use the Adam W optimizer [84] with learning_rate = 0.0001, β1 = 0.9, β2 = 0.999, ϵ = 1e 8, and weight_decay = 1e 63. We use a cosine learning rate schedule, with warmup of 2000 steps, and decay final learning rate down to 10% of the peak learning rate. We set max_new_tokens = 200, temperature = 1 and top_p = 0.9 to provide a high level diversity and randomness to the generated output.