RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Authors: Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, Christopher D Manning

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the Qu ALITY benchmark by 20% in absolute accuracy.
Researcher Affiliation Academia Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, Christopher D. Manning Stanford University psarthi@cs.stanford.edu
Pseudocode Yes We provide the pseudocode of both methods in Appendix F. Algorithm 1 Tree Traversal Algorithm Algorithm 2 Collapsed Tree Algorithm
Open Source Code Yes We have released the code of RAPTOR at https://github.com/parthsarthi03/raptor.
Open Datasets Yes The three evaluation datasets used in our experiments Qu ALITY, QASPER, and Narrative QA are all publicly accessible. These datasets ensure that the retrieval and QA tests conducted in this study can be replicated.
Dataset Splits No The paper mentions using a 'dev dataset' for QuALITY (Table 4) and testing on '20 stories from the QASPER dataset' (Figure 3), but does not provide explicit percentages or counts for training, validation, and test splits for the overall datasets.
Hardware Specification Yes To assess the computational efficiency and cost-effectiveness of RAPTOR s tree-building process, we conducted experiments on a consumer-grade laptop, specifically an Apple M1 Mac with 16GB of RAM.
Software Dependencies No Four language models are used in our RAPTOR experiments: GPT-3 and GPT-4 for QA tasks, and GPT-3.5-turbo for summarization. The gpt-3, gpt-4, and gpt-3.5-turbo models can be accessed via API calls (Open AI API). Unified QA, which is used for QA tasks, is publicly available at Hugging Face. The paper mentions SBERT (multi-qa-mpnet-base-cos-v1) but does not provide a version number for the SBERT library or other software like UMAP.
Experiment Setup Yes Specifically, we use the collapsed tree with 2000 maximum tokens, which approximately equates to retrieving the top-20 nodes. Using a token-based approach ensures the context does not exceed model context constraints as token counts can vary across nodes. For experiments with the Unified QA model, we provide 400 tokens of context, as Unified QA has a max context length of 512 tokens. The number of nearest neighbors parameter, n neighbors, in UMAP is varied.