Estimating the Hallucination Rate of Generative AI
Authors: Andrew Jesson, Nicolas Beltran Velez, Quentin Chu, Sweta Karlekar, Jannik Kossen, Yarin Gal, John P. Cunningham, David Blei
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate our method using large language models for synthetic regression and natural language ICL tasks. In Section 3, we empirically evaluate our methods. |
| Researcher Affiliation | Academia | Correspondence to {adj2147, nb2838}@columbia.edu. Department of Statistics, Columbia University. Department of Computer Science, Columbia University. OATML, Department of Computer Science, University of Oxford. |
| Pseudocode | Yes | Algorithm 1 d PHR(x, Dn, pθ, M, N, K) and Algorithm 2 d THR(x, Dn, D, pθ, K) |
| Open Source Code | Yes | We provide implementations of the method proposed and experiments used in this paper in https://github.com/blei-lab/phr. |
| Open Datasets | Yes | We consider tasks defined by six datasets: Stanford Sentiment Treebank (SST2) [49], Subjectivity [50], AG News [6], Medical QP [51], RTE [52], and WNLI [53]. |
| Dataset Splits | Yes | For each task and context length n [2, 4, 8, 16, 32], we sample 50 random training datasets Dn, 50 evaluation datasets Deval, and 10 random test samples. |
| Hardware Specification | Yes | For our experiments, we used an internal cluster made up of A100s and RTX 8000s, each with 40 to 48 GB of memory. |
| Software Dependencies | No | We implement our neural process by modifying the Llama 2 architecture [42]...We run LLa MA-2-7B as an unquantized model (16-bit). |
| Experiment Setup | Yes | We train the model from random initialization on sequences of (x, y) pairs using a standard next token prediction objective and use the Adam W optimizer [84] with learning_rate = 0.0001, β1 = 0.9, β2 = 0.999, ϵ = 1e 8, and weight_decay = 1e 63. We use a cosine learning rate schedule, with warmup of 2000 steps, and decay final learning rate down to 10% of the peak learning rate. We set max_new_tokens = 200, temperature = 1 and top_p = 0.9 to provide a high level diversity and randomness to the generated output. |