reproducibilityindex.ai

Estimating the Hallucination Rate of Generative AI

Authors: Andrew Jesson, Nicolas Beltran Velez, Quentin Chu, Sweta Karlekar, Jannik Kossen, Yarin Gal, John P. Cunningham, David Blei

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate our method using large language models for synthetic regression and natural language ICL tasks. In Section 3, we empirically evaluate our methods.
Researcher Affiliation	Academia	Correspondence to {adj2147, nb2838}@columbia.edu. Department of Statistics, Columbia University. Department of Computer Science, Columbia University. OATML, Department of Computer Science, University of Oxford.
Pseudocode	Yes	Algorithm 1 d PHR(x, Dn, pθ, M, N, K) and Algorithm 2 d THR(x, Dn, D, pθ, K)
Open Source Code	Yes	We provide implementations of the method proposed and experiments used in this paper in https://github.com/blei-lab/phr.
Open Datasets	Yes	We consider tasks defined by six datasets: Stanford Sentiment Treebank (SST2) [49], Subjectivity [50], AG News [6], Medical QP [51], RTE [52], and WNLI [53].
Dataset Splits	Yes	For each task and context length n [2, 4, 8, 16, 32], we sample 50 random training datasets Dn, 50 evaluation datasets Deval, and 10 random test samples.
Hardware Specification	Yes	For our experiments, we used an internal cluster made up of A100s and RTX 8000s, each with 40 to 48 GB of memory.
Software Dependencies	No	We implement our neural process by modifying the Llama 2 architecture [42]...We run LLa MA-2-7B as an unquantized model (16-bit).
Experiment Setup	Yes	We train the model from random initialization on sequences of (x, y) pairs using a standard next token prediction objective and use the Adam W optimizer [84] with learning_rate = 0.0001, β1 = 0.9, β2 = 0.999, ϵ = 1e 8, and weight_decay = 1e 63. We use a cosine learning rate schedule, with warmup of 2000 steps, and decay final learning rate down to 10% of the peak learning rate. We set max_new_tokens = 200, temperature = 1 and top_p = 0.9 to provide a high level diversity and randomness to the generated output.