MemoryFormer : Minimize Transformer Computation by Removing Fully-Connected Layers
Authors: Ning Ding, Yehui Tang, Haochen Qin, Zhenli Zhou, Chao Xu, Lin Li, Kai Han, Liao Heng, Yunhe Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train Memory Former from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model. |
| Researcher Affiliation | Collaboration | 1 State Key Lab of General AI, School of Intelligence Science and Technology, Peking University 2 Huawei Noah s Ark Lab. 3 Huawei Hi Silicon |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The code and data will not be provided. |
| Open Datasets | Yes | As for training data, the Pile [12] dataset contains 825 Gi B corpus... We choose six widely-used evaluation task for our approach: PIQA [4], Wino Grande [23], WSC [24], ARC-E, ARC-C [9], and Logi QA[19]. |
| Dataset Splits | No | The paper mentions training on a 'validation set' (Tab. 4) but does not provide specific details on the split percentages or sample counts for this set. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Pythia' as an LLM training framework, but does not provide specific version numbers for software components like PyTorch, CUDA, or other libraries. |
| Experiment Setup | Yes | We use exactly the same optimizer, scheduler and other hyper-parameters following the setting of Pythia to conduct fair comparisons. As for the hyper-parameter of Memory Layer, we fix the value of τ to be 8, while the number of hash tables K is 64, 96 and 128 respectively for Memory Former-tiny, -small and -base model. Notably, considering the sparsity of gradients of the hash tables, we set the learning rate to be 3 times of the baseline learning rate used by the corresponding Pythia model. |