reproducibilityindex.ai

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Authors: Filip Szatkowski, Bartosz Wójcik, Mikołaj Piórczyński, Simone Scardapane

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate D2DMo E across benchmarks in text classification, image classification, and language modeling, demonstrating significant improvements in cost-performance trade-offs in all cases.
Researcher Affiliation	Academia	Filip Szatkowski IDEAS NCBR Warsaw University of Technology Bartosz Wójcik IDEAS NCBR Jagiellonian University Mikołaj Piórczy nski Warsaw University of Technology Simone Scardapane Sapienza University of Rome
Pseudocode	Yes	Listing 1: Simplified pseudocode of our efficient D2DMo E implementation for GPUs
Open Source Code	Yes	The source code for our experiments is available at: https://github.com/bartwojcik/D2DMo E.
Open Datasets	Yes	We evaluate it on the popular Image Net-1k [35] dataset. We use a pre-trained Vi T-B checkpoint as the base model and compare D2DMo E with Mo Efication in terms of the computational cost versus accuracy trade-off. [...] We evaluate our method with BERT-base [6] on the CARER dataset [36] that contains text samples categorized into 6 different emotion categories. [...] We evaluate our method on language modeling and compare it with Mo Efication using GPT-2base [31] and Gemma-2B [42]. We initialize GPT-2 models from a publicly available Open AI checkpoint pre-trained on a closed-source Web Text dataset and use Open Web Text [12] in all of our experiments. For Gemma-2B, we also start from the publicly available pretrained model and evaluate its language capabilities on the C4 dataset [32] after finetuning.
Dataset Splits	Yes	For D2DMo E, we replace the MHA projections and train the replacements for 3 epochs with the initial learning rate 0.001 and batch size 128, and then finetune the model for 90 epochs with sparsity enforcement weight α = 0.2, initial learning rate 2 10 5 and batch size 512. [...] For Mo Efication, we first convert the pre-trained model to Re LU-based one and finetune for 90 epochs with an initial learning rate of 0.0001 and batch size 256. [...] We finetuned base dense models on CARER dataset for 5 epochs with 2 10 5 learning rate. [...] For Gemma-2B, we also start from the publicly available pretrained model and evaluate its language capabilities on the C4 dataset [32] after finetuning.
Hardware Specification	Yes	We perform the experiments on an NVIDIA A100 GPU. [...] Table 2: Wall-clock time measurements (µs) of execution of our D2DMo E layer when using different data types and GPUs. GPU RTX 4090 float32... A100 float32...
Software Dependencies	No	The paper states: 'All experiments were performed using the Py Torch library [29] on the NVIDIA A100 and V100 GPUs on internal clusters. We utilize the fvcore library to count model FLOPs8.' While PyTorch and fvcore are mentioned, specific version numbers for these software components are not provided.
Experiment Setup	Yes	For D2DMo E, we replace the MHA projections and train the replacements for 3 epochs with the initial learning rate 0.001 and batch size 128, and then finetune the model for 90 epochs with sparsity enforcement weight α = 0.2, initial learning rate 2 10 5 and batch size 512. [...] We finetuned base dense models on CARER dataset for 5 epochs with 2 10 5 learning rate. For sparsity enforcement in D2DMo E we use α linearly increasing from zero to 0.0001 over training. For both Mo Efication and D2DMo E we train routers with batch size 64 and initial learning rate 0.001 for 5 epochs.