reproducibilityindex.ai

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

Authors: Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, Yu Su

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare Hippo RAG with existing RAG methods on multi-hop question answering (QA) and show that our method outperforms the state-of-the-art methods remarkably, by up to 20%.
Researcher Affiliation	Academia	Bernal Jiménez Gutiérrez The Ohio State University Yiheng Shu The Ohio State University Yu Gu The Ohio State University Michihiro Yasunaga Stanford University Yu Su The Ohio State University
Pseudocode	No	The paper describes its methodology in prose and figures (e.g., Figure 2, Figure 4, Figure 5), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code and data are available at https://github.com/OSU-NLP-Group/Hippo RAG.
Open Datasets	Yes	We evaluate our method s retrieval capabilities primarily on two challenging multi-hop QA benchmarks, Mu Si Que (answerable) [77] and 2Wiki Multi Hop QA [33].
Dataset Splits	Yes	To limit the experimental cost, we extract 1,000 questions from each validation set as done in previous work [63, 78]. In order to create a more realistic retrieval setting, we follow IRCo T [78] and collect all candidate passages (including supporting and distractor passages) from our selected questions and form a retrieval corpus for each dataset. The details of these datasets are shown in Table 1.
Hardware Specification	Yes	We run Col BERTv2 and Contriever for indexing and retrieval we use 4 NVIDIA RTX A6000 GPUs with 48GB of memory. For indexing with Llama-3.1 models, we use 4 NVIDIA H100 GPUs with 80GB of memory. Finally, we used 2 AMD EPYC 7513 32-Core Processors to run the Personalized Page Rank algorithm.
Software Dependencies	No	We use implementations based on Py Torch [59] and Hugging Face [86] for both Contriever [35] and Col BERTv2 [70]. We use the python-igraph [13] implementation of the PPR algorithm. For BM25, we employ Elastic Search [24].
Experiment Setup	Yes	By default, we use GPT-3.5-turbo-1106 [55] with temperature of 0 as our LLM L and Contriever [35] or Col BERTv2 [70] as our retriever M. We use 100 examples from Mu Si Que s training data to tune Hippo RAG s two hyperparameters: the synonymy threshold τ at 0.8 and the PPR damping factor at 0.5, which determines the probability that PPR will restart a random walk from the query nodes instead of continuing to explore the graph.