MemoryFormer : Minimize Transformer Computation by Removing Fully-Connected Layers

Authors: Ning Ding, Yehui Tang, Haochen Qin, Zhenli Zhou, Chao Xu, Lin Li, Kai Han, Liao Heng, Yunhe Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train Memory Former from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.
Researcher Affiliation Collaboration 1 State Key Lab of General AI, School of Intelligence Science and Technology, Peking University 2 Huawei Noah s Ark Lab. 3 Huawei Hi Silicon
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The code and data will not be provided.
Open Datasets Yes As for training data, the Pile [12] dataset contains 825 Gi B corpus... We choose six widely-used evaluation task for our approach: PIQA [4], Wino Grande [23], WSC [24], ARC-E, ARC-C [9], and Logi QA[19].
Dataset Splits No The paper mentions training on a 'validation set' (Tab. 4) but does not provide specific details on the split percentages or sample counts for this set.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using 'Pythia' as an LLM training framework, but does not provide specific version numbers for software components like PyTorch, CUDA, or other libraries.
Experiment Setup Yes We use exactly the same optimizer, scheduler and other hyper-parameters following the setting of Pythia to conduct fair comparisons. As for the hyper-parameter of Memory Layer, we fix the value of τ to be 8, while the number of hash tables K is 64, 96 and 128 respectively for Memory Former-tiny, -small and -base model. Notably, considering the sparsity of gradients of the hash tables, we set the learning rate to be 3 times of the baseline learning rate used by the corresponding Pythia model.