RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
Authors: Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, Christopher D Manning
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the Qu ALITY benchmark by 20% in absolute accuracy. |
| Researcher Affiliation | Academia | Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, Christopher D. Manning Stanford University psarthi@cs.stanford.edu |
| Pseudocode | Yes | We provide the pseudocode of both methods in Appendix F. Algorithm 1 Tree Traversal Algorithm Algorithm 2 Collapsed Tree Algorithm |
| Open Source Code | Yes | We have released the code of RAPTOR at https://github.com/parthsarthi03/raptor. |
| Open Datasets | Yes | The three evaluation datasets used in our experiments Qu ALITY, QASPER, and Narrative QA are all publicly accessible. These datasets ensure that the retrieval and QA tests conducted in this study can be replicated. |
| Dataset Splits | No | The paper mentions using a 'dev dataset' for QuALITY (Table 4) and testing on '20 stories from the QASPER dataset' (Figure 3), but does not provide explicit percentages or counts for training, validation, and test splits for the overall datasets. |
| Hardware Specification | Yes | To assess the computational efficiency and cost-effectiveness of RAPTOR s tree-building process, we conducted experiments on a consumer-grade laptop, specifically an Apple M1 Mac with 16GB of RAM. |
| Software Dependencies | No | Four language models are used in our RAPTOR experiments: GPT-3 and GPT-4 for QA tasks, and GPT-3.5-turbo for summarization. The gpt-3, gpt-4, and gpt-3.5-turbo models can be accessed via API calls (Open AI API). Unified QA, which is used for QA tasks, is publicly available at Hugging Face. The paper mentions SBERT (multi-qa-mpnet-base-cos-v1) but does not provide a version number for the SBERT library or other software like UMAP. |
| Experiment Setup | Yes | Specifically, we use the collapsed tree with 2000 maximum tokens, which approximately equates to retrieving the top-20 nodes. Using a token-based approach ensures the context does not exceed model context constraints as token counts can vary across nodes. For experiments with the Unified QA model, we provide 400 tokens of context, as Unified QA has a max context length of 512 tokens. The number of nearest neighbors parameter, n neighbors, in UMAP is varied. |