ResMem: Learn what you can and memorize the rest
Authors: Zitong Yang, MICHAL LUKASIK, Vaishnavh Nagarajan, Zonglin Li, Ankit Rawat, Manzil Zaheer, Aditya K. Menon, Sanjiv Kumar
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that Res Mem consistently improves the test set generalization of the original prediction model across standard vision and natural language processing benchmarks. |
| Researcher Affiliation | Collaboration | Zitong Yang Stanford University Stanford, CA 94305 zitong@berkeley.edu Michal Lukasik Google Research New York, NY, 10011 mlukasik@google.com |
| Pseudocode | No | The paper describes the algorithm steps in numbered lists (Section 4.1) and illustrates it in Figure 1, but does not present it in a formal pseudocode or 'Algorithm' block. |
| Open Source Code | No | No explicit statement providing concrete access to source code for the methodology described in this paper was found. |
| Open Datasets | Yes | Empirically, we show that such explicit memorization indeed leads to generalization benefits: Res Mem consistently improves the test accuracy of a baseline Deep Net on image classification tasks with CIFAR100 [33], and autoregressive language modeling on C4 [42] (Section 4). |
| Dataset Splits | Yes | For the language experiment... we created the query embeddings using the whole validation split and the same representation location. |
| Hardware Specification | No | The paper mentions 'CPU latency' but does not specify any particular CPU model. No other specific hardware details like GPU models, CPU types, or cloud instance specifications used for experiments are provided. |
| Software Dependencies | No | The paper mentions using 'Keras' for Mobile Net-V2 models and 'Sca NN' for nearest neighbor search, but no specific version numbers for these or other software dependencies are provided. |
| Experiment Setup | Yes | For all six Deep Net training, we use SGD with batch size 128, trained for 256 epochs. We use a peak learning rate 0.4, and momentum 0.9. We warm up the learning rate linearly for the first 15 epochs, and decay the learning rate by 0.1 after epochs {96, 192, 224}. For Res Mem, we use ... σ = 0.7, k = 53, and T = 1.4... We pre-trained the Deep Net ... for 1,000,000 steps, with dropout rate of 0.1 and batch size of 128. The learning rate for the first 10,000 steps is fixed to 0.01... |