Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Authors: Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that SELF-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, SELF-RAG outperforms Chat GPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models. |
| Researcher Affiliation | Collaboration | University of Washington Allen Institute for AI IBM Research AI {akari,zeqiuwu1,yizhongw,hannaneh}@cs.washington.edu, avi@us.ibm.com |
| Pseudocode | Yes | Algorithm 1 SELF-RAG Inference |
| Open Source Code | Yes | Our code and trained models are available at https://selfrag.github.io/. |
| Open Datasets | Yes | In particular, we sample instances from Open-Instruct processed data (Wang et al., 2023) and knowledge-intensive datasets (Petroni et al., 2021; Stelmakh et al., 2022; Mihaylov et al., 2018). |
| Dataset Splits | Yes | We conduct evaluations of our SELF-RAG and diverse baselines on a range of downstream tasks, holistically evaluating outputs with metrics designed to assess overall correctness, factuality, and fluency. Throughout these experiments, we conduct zero-shot evaluations, where we provide instructions describing tasks without few-shot demonstrations (Wei et al., 2022; Sanh et al., 2022). |
| Hardware Specification | Yes | We use 4 Nvidia A100 with 80GB memory to train our models. ... We run inference of our trained models using 1-2 Quadro RTX 6000 GPUs with 24GB memory. |
| Software Dependencies | No | The paper mentions 'Deepspeed stage 3' and 'Flash Attention (Dao et al., 2022)' and 'vllm (Kwon et al., 2023)' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | All models are trained for 3 epochs with a batch size of 128, a peak learning rate of 2e-5 with 3% warmup steps, and linear decay afterward. We set the maximum token length to be 2,048 for the 7B model, and 1,524 for the 13B model due to the memory constraint. |