Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion
Authors: Filip Szatkowski, Bartosz Wójcik, Mikołaj Piórczyński, Simone Scardapane
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate D2DMo E across benchmarks in text classification, image classification, and language modeling, demonstrating significant improvements in cost-performance trade-offs in all cases. |
| Researcher Affiliation | Academia | Filip Szatkowski IDEAS NCBR Warsaw University of Technology Bartosz Wójcik IDEAS NCBR Jagiellonian University Mikołaj Piórczy nski Warsaw University of Technology Simone Scardapane Sapienza University of Rome |
| Pseudocode | Yes | Listing 1: Simplified pseudocode of our efficient D2DMo E implementation for GPUs |
| Open Source Code | Yes | The source code for our experiments is available at: https://github.com/bartwojcik/D2DMo E. |
| Open Datasets | Yes | We evaluate it on the popular Image Net-1k [35] dataset. We use a pre-trained Vi T-B checkpoint as the base model and compare D2DMo E with Mo Efication in terms of the computational cost versus accuracy trade-off. [...] We evaluate our method with BERT-base [6] on the CARER dataset [36] that contains text samples categorized into 6 different emotion categories. [...] We evaluate our method on language modeling and compare it with Mo Efication using GPT-2base [31] and Gemma-2B [42]. We initialize GPT-2 models from a publicly available Open AI checkpoint pre-trained on a closed-source Web Text dataset and use Open Web Text [12] in all of our experiments. For Gemma-2B, we also start from the publicly available pretrained model and evaluate its language capabilities on the C4 dataset [32] after finetuning. |
| Dataset Splits | Yes | For D2DMo E, we replace the MHA projections and train the replacements for 3 epochs with the initial learning rate 0.001 and batch size 128, and then finetune the model for 90 epochs with sparsity enforcement weight α = 0.2, initial learning rate 2 10 5 and batch size 512. [...] For Mo Efication, we first convert the pre-trained model to Re LU-based one and finetune for 90 epochs with an initial learning rate of 0.0001 and batch size 256. [...] We finetuned base dense models on CARER dataset for 5 epochs with 2 10 5 learning rate. [...] For Gemma-2B, we also start from the publicly available pretrained model and evaluate its language capabilities on the C4 dataset [32] after finetuning. |
| Hardware Specification | Yes | We perform the experiments on an NVIDIA A100 GPU. [...] Table 2: Wall-clock time measurements (µs) of execution of our D2DMo E layer when using different data types and GPUs. GPU RTX 4090 float32... A100 float32... |
| Software Dependencies | No | The paper states: 'All experiments were performed using the Py Torch library [29] on the NVIDIA A100 and V100 GPUs on internal clusters. We utilize the fvcore library to count model FLOPs8.' While PyTorch and fvcore are mentioned, specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | For D2DMo E, we replace the MHA projections and train the replacements for 3 epochs with the initial learning rate 0.001 and batch size 128, and then finetune the model for 90 epochs with sparsity enforcement weight α = 0.2, initial learning rate 2 10 5 and batch size 512. [...] We finetuned base dense models on CARER dataset for 5 epochs with 2 10 5 learning rate. For sparsity enforcement in D2DMo E we use α linearly increasing from zero to 0.0001 over training. For both Mo Efication and D2DMo E we train routers with batch size 64 and initial learning rate 0.001 for 5 epochs. |