Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Authors: Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that SELF-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, SELF-RAG outperforms Chat GPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models. |
| Researcher Affiliation | Collaboration | University of Washington Allen Institute for AI IBM Research AI EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 SELF-RAG Inference |
| Open Source Code | Yes | Our code and trained models are available at https://selfrag.github.io/. |
| Open Datasets | Yes | In particular, we sample instances from Open-Instruct processed data (Wang et al., 2023) and knowledge-intensive datasets (Petroni et al., 2021; Stelmakh et al., 2022; Mihaylov et al., 2018). |
| Dataset Splits | Yes | We conduct evaluations of our SELF-RAG and diverse baselines on a range of downstream tasks, holistically evaluating outputs with metrics designed to assess overall correctness, factuality, and fluency. Throughout these experiments, we conduct zero-shot evaluations, where we provide instructions describing tasks without few-shot demonstrations (Wei et al., 2022; Sanh et al., 2022). |
| Hardware Specification | Yes | We use 4 Nvidia A100 with 80GB memory to train our models. ... We run inference of our trained models using 1-2 Quadro RTX 6000 GPUs with 24GB memory. |
| Software Dependencies | No | The paper mentions 'Deepspeed stage 3' and 'Flash Attention (Dao et al., 2022)' and 'vllm (Kwon et al., 2023)' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | All models are trained for 3 epochs with a batch size of 128, a peak learning rate of 2e-5 with 3% warmup steps, and linear decay afterward. We set the maximum token length to be 2,048 for the 7B model, and 1,524 for the 13B model due to the memory constraint. |