Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MEMORYLLM: Towards Self-Updatable Large Language Models
Authors: Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, Julian Mcauley
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations demonstrate the ability of MEMORYLLM to effectively incorporate new knowledge, as evidenced by its performance on model editing benchmarks. Meanwhile, the model exhibits long-term information retention capacity, which is validated through our custom-designed evaluations and long-context benchmarks. |
| Researcher Affiliation | Collaboration | 1UC, San Diego 2Amazon 3UC, Los Angeles. |
| Pseudocode | Yes | Algorithm 1 Training Strategy for Mitigating Forgetting Problems |
| Open Source Code | Yes | Our code and model are open-sourced at https://github.com/wangyu-ustc/ Memory LLM. |
| Open Datasets | Yes | We train our model on the processed version of the C4 dataset (Raffel et al., 2020) from Red-Pajama (Computer, 2023). |
| Dataset Splits | No | The paper uses the C4 dataset for training and uses existing benchmark datasets (Zs RE, Counter Factual, Long Bench, SQuAD, Natural QA) with their respective evaluation methodologies. However, it does not explicitly provide the specific training/validation/test splits (e.g., percentages or exact counts) they used for their main model's development and evaluation across all phases, beyond describing the subsets used for evaluation tasks. |
| Hardware Specification | Yes | The training is performed on 8 A100-80GB GPUs for three days. |
| Software Dependencies | No | The paper mentions using Llama2-7b but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | In our instantiation, N = 7,680 and K = 256. (Section 4.5.1) |