reproducibilityindex.ai

MemoryFormer : Minimize Transformer Computation by Removing Fully-Connected Layers

Authors: Ning Ding, Yehui Tang, Haochen Qin, Zhenli Zhou, Chao Xu, Lin Li, Kai Han, Liao Heng, Yunhe Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train Memory Former from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.
Researcher Affiliation	Collaboration	1 State Key Lab of General AI, School of Intelligence Science and Technology, Peking University 2 Huawei Noah s Ark Lab. 3 Huawei Hi Silicon
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The code and data will not be provided.
Open Datasets	Yes	As for training data, the Pile [12] dataset contains 825 Gi B corpus... We choose six widely-used evaluation task for our approach: PIQA [4], Wino Grande [23], WSC [24], ARC-E, ARC-C [9], and Logi QA[19].
Dataset Splits	No	The paper mentions training on a 'validation set' (Tab. 4) but does not provide specific details on the split percentages or sample counts for this set.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions using 'Pythia' as an LLM training framework, but does not provide specific version numbers for software components like PyTorch, CUDA, or other libraries.
Experiment Setup	Yes	We use exactly the same optimizer, scheduler and other hyper-parameters following the setting of Pythia to conduct fair comparisons. As for the hyper-parameter of Memory Layer, we fix the value of τ to be 8, while the number of hash tables K is 64, 96 and 128 respectively for Memory Former-tiny, -small and -base model. Notably, considering the sparsity of gradients of the hash tables, we set the learning rate to be 3 times of the baseline learning rate used by the corresponding Pythia model.