reproducibilityindex.ai

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Authors: Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that SELF-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, SELF-RAG outperforms Chat GPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.
Researcher Affiliation	Collaboration	University of Washington Allen Institute for AI IBM Research AI {akari,zeqiuwu1,yizhongw,hannaneh}@cs.washington.edu, avi@us.ibm.com
Pseudocode	Yes	Algorithm 1 SELF-RAG Inference
Open Source Code	Yes	Our code and trained models are available at https://selfrag.github.io/.
Open Datasets	Yes	In particular, we sample instances from Open-Instruct processed data (Wang et al., 2023) and knowledge-intensive datasets (Petroni et al., 2021; Stelmakh et al., 2022; Mihaylov et al., 2018).
Dataset Splits	Yes	We conduct evaluations of our SELF-RAG and diverse baselines on a range of downstream tasks, holistically evaluating outputs with metrics designed to assess overall correctness, factuality, and fluency. Throughout these experiments, we conduct zero-shot evaluations, where we provide instructions describing tasks without few-shot demonstrations (Wei et al., 2022; Sanh et al., 2022).
Hardware Specification	Yes	We use 4 Nvidia A100 with 80GB memory to train our models. ... We run inference of our trained models using 1-2 Quadro RTX 6000 GPUs with 24GB memory.
Software Dependencies	No	The paper mentions 'Deepspeed stage 3' and 'Flash Attention (Dao et al., 2022)' and 'vllm (Kwon et al., 2023)' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	All models are trained for 3 epochs with a batch size of 128, a peak learning rate of 2e-5 with 3% warmup steps, and linear decay afterward. We set the maximum token length to be 2,048 for the 7B model, and 1,524 for the 13B model due to the memory constraint.