SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs

Authors: Jaehyung Kim, Jaehyun Nam, Sangwoo Mo, Jongjin Park, Sang-Woo Lee, Minjoon Seo, Jung-Woo Ha, Jinwoo Shin

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on diverse ODQA benchmarks demonstrate the superiority of SURE, with improvements of up to 4.6% in exact match (EM) and 4.0% in F1 score over standard prompting approaches.
Researcher Affiliation Collaboration Jaehyung Kim1, Jaehyun Nam2 Sangwoo Mo3 Jongjin Park2 Sang-Woo Lee2,5 Minjoon Seo2 Jung-Woo Ha4,5 Jinwoo Shin2 1Carnegie Mellon University 2KAIST AI 3University of Michigan 4Naver AI Lab 5Naver Cloud
Pseudocode Yes Algorithm 1 SURE algorithm 1: Input: Large language model M, question q, N retrieved passages C+ N, candidate number K 2: Answer Candidate Generation: ey = M pcan(q, C+ N) , ey = [ey1, . . . , ey K] 3: Conditional Summarization: sk = M psum(q, C+ N, yk) for k = 1, . . . , K 4: Instance-wise Validation: v(sk) Eq. 4 with M (pval(q, sk)) 5: Pair-wise Ranking: r(sk, SK), rpair(sk, si) Eq. 5 with M (prank(q, sk, si)) 6: Output: Prediction ba = eyk , k = arg maxk v(sk) + r(sk, SK)
Open Source Code Yes 1The code is available at https://github.com/bbuing9/ICLR24_SuRe
Open Datasets Yes For all experiments, we measure zero-shot QA accuracy with the four different ODQA datasets: (1) Natural Questions (NQ) (Kwiatkowski et al., 2019), (2) Web Questions (Web Q) (Berant et al., 2013), (3) 2Wiki Multi-hop QA (2Wiki) (Ho et al., 2020), and (4) Hotpot QA (Yang et al., 2018).
Dataset Splits Yes For NQ and Web Q, we use their original test splits and 21M English Wikipedia dump (Karpukhin et al., 2020) as the source passages for the retrieval. For 2Wiki and Hotpot QA, we use the subsampled splits released by Trivedi et al. (2023), along with the corresponding corpus for each data. For the experiments with LLa MA2-chat (Table 2) and more analyses (Section 4.3), we took 500 randomly subsampled examples of NQ and Web Q datasets for efficient experiments considering limited computing resources, and denoted these datasets NQ and Web Q , respectively.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as specific GPU/CPU models, processor types, or memory.
Software Dependencies Yes For the experiments, we use three recent state-of-the-art LLMs: Chat GPT (gpt-3.5-turbo-0301) (Open AI, 2022), GPT-4 (gpt-4-0613) (Open AI, 2023), and LLa MA2-chat-70B (Touvron et al., 2023b). We use a temperature of 0.0 when calling the API or greedy decoding for LLa MA, to remove the effect of random sampling (Sun et al., 2023). For the retrieval methods, we use three different approaches: BM25 (Robertson et al., 2009), DPR-multi (DPR) (Karpukhin et al., 2020), and Contriever (Izacard et al., 2022). We use the implementations in Elasticsearch for BM25, and BEIR for DPR and Contriever, respectively.
Experiment Setup Yes We use a temperature of 0.0 when calling the API or greedy decoding for LLa MA, to remove the effect of random sampling (Sun et al., 2023). In the case of SURE, we use the same prompts across the different datasets, and they are presented in Appendix A. Also, we use a fixed value of K = 2 during the experiments since we observe that the improvements by increasing K are limited, as shown in Appendix B. When there are multiple candidates with equal plausibility (Eq. 6), then SURE selects the one generated earlier in Eq. 2.