reproducibilityindex.ai

Larimar: Large Language Models with Episodic Memory Control

Authors: Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarathkrishna Swaminathan, Sihui Dai, Aurelie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Jiri Navratil, Soham Dan, Pin-Yu Chen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on multiple fact editing benchmarks demonstrate that Larimar attains accuracy comparable to most competitive baselines, even in the challenging sequential editing setup, but also excels in speed yielding speed-ups of 8-10x depending on the base LLM as well as flexibility due to the proposed architecture being simple, LLM-agnostic, and hence general.
Researcher Affiliation	Collaboration	1IBM AI Research 2Princeton University; work done during internship at IBM Research. Correspondence to: Payel Das and Subhajit Chaudhury <daspa@us.ibm.com; subhajit@ibm.com>.
Pseudocode	Yes	Algorithm 1 Basic Memory operations (Pham et al., 2021) Function write(Z): // Z encoding of the episode to be written to memory (i.e. Z = e(X)) 1 Sample ξ N(0, σ2 ξI) Let Zξ = Z + ξ Compute addressing weight W0 = ZξM 0 // M0 is a learned parameter representing prior memory 2 Compute posterior memory M = W 0Zξ return M Function read(Z, M): // M posterior memory from previous write // Z encoding of the read input (ie. Z = e(X)) 3 Compute mean addressing weight W = ZM Sample W N(W, σ2 WI) // σW is a learned parameter 4 Compute output latent Zread = WM return Zread Function generate(M): // M is the posterior memory from a previous write 5 Sample W N(0, I) Compute output latent Z = WM return Z
Open Source Code	Yes	Our code is available at https://github.com/IBM/larimar.
Open Datasets	Yes	Our training data comprised 7.6 million examples constructed by splitting Wiki Text (Merity et al., 2016) texts to small chunks of 64 tokens. We compare the performance of Larimar against a number of recently proposed knowledge editing approaches on the Counter Fact dataset (Meng et al., 2022a) designed for testing language models handling of counterfactual edits. We also evaluated Larimar on the Zs RE benchmark (Levy et al., 2017), a QA dataset for relation extraction through reading comprehension, with results displayed in Table 12. For this purpose, we curated facts from CNN Fast Facts (CNN, 2023) for 2021, 2022, and 2023.
Dataset Splits	Yes	Following other works (Meng et al., 2022a; Zheng et al., 2023), we used the first 2000 samples of this dataset and report the average over single fact editing results for Larimar-1.3B and Larimar-6B in Table 2. We adapt Larimar to this experimental setup, wherein a subset of 200 facts with 5 rephrasings each is selected from the Zs RE validation dataset for testing.
Hardware Specification	Yes	For the Larimar-6B s training, we used a setup with eight NVIDIA A100-80GB GPUs on a single node, utilizing bfloat16 precision and Py Torch Lightning with the Deep Speed Ze RO Stage 2 for efficient distributed training.
Software Dependencies	No	The paper mentions ‘Py Torch Lightning with the Deep Speed Ze RO Stage 2’ but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We trained Larimar-6B models for 10 epochs using Adam optimizer, learning rate 5e-6 and batch size 32. For the Larimar-6B s training, we used a setup with eight NVIDIA A100-80GB GPUs on a single node, utilizing bfloat16 precision and Py Torch Lightning with the Deep Speed Ze RO Stage 2 for efficient distributed training.