Augmenting Language Models with Long-Term Memory
Authors: Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, Furu Wei
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our method outperforms strong long-context models on Chapter Break, a challenging long-context modeling benchmark, and achieves remarkable improvements on memory-augmented in-context learning over LLMs. |
| Researcher Affiliation | Collaboration | University of California, Santa Barbara Microsoft Research weizhiwang@ucsb.edu, {lidong1, chehao, xiaodl}@microsoft.com |
| Pseudocode | No | The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured code blocks. |
| Open Source Code | Yes | Our code is open-sourced at https://aka.ms/Long Mem. |
| Open Datasets | Yes | We sample a subset of the Pile [GBB+20] as the training corpus, including Book Corpus2, Books3, Open Web Text2, Stack Exchange, Wikipedia, Gutenberg (PG-19), NIH Ex Porter, and Pile-CC datasets. |
| Dataset Splits | Yes | We provide different validation splits of PG-22 based on length range, and the data statistics are presented in Table 1. |
| Hardware Specification | Yes | The pre-training and adaptation are trained on 16 32GB-Tesla-V100 GPUs. |
| Software Dependencies | No | The paper mentions using the 'Adam optimizer' and 'faiss toolkit' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The training for memory-augmented adaptation iterates on 26B tokens, with a global 256 batch-size and 1024 sequence length. The chunk-size csz is 4 tokens and the memory size M is 65k key-value pairs of tokens. For each token, we retrieve K=64 attention key-value pairs for augmentation, which are K/csz=16 text chunks. The memory-augmentation layer is the 9-th layer of Side Net. The attention keys and values from 18-th layer of backbone LLM is cached into memory and used for future retrieval. Other training details are presented in Appendix C. |