LookupFFN: Making Transformers Compute-lite for CPU inference

Authors: Zhanpeng Zeng, Michael Davies, Pranav Pulijala, Karthikeyan Sankaralingam, Vikas Singh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For Ro BERTa language model pretraining, our formulation achieves similar performance compared to GEMM based FFNs, while dramatically reducing the required FLOP. Our development is complemented with a detailed hardware profiling of strategies that will maximize efficiency not just on contemporary hardware but on products that will be offered in the near/medium term future. Code is avaiable at https://github.com/ mlpen/Lookup FFN. and In this section, we will present our empirical results evaluating the benefits/limitations of replacing Vanilla FFN with a Lookup FFN in a Transformer, and conduct a detailed performance profiling of Lookup FFN.
Researcher Affiliation Collaboration Zhanpeng Zeng 1 Michael Davies 1 Pranav Pulijala 1 Karthikeyan Sankaralingam 1 2 Vikas Singh 1 and 1University of Wisconsin, Madison, USA 2NVIDIA Research.
Pseudocode No The paper describes the operations (Hash and Gather) and illustrates them with Figure 3, but does not provide formal pseudocode or an algorithm block.
Open Source Code Yes Code is avaiable at https://github.com/ mlpen/Lookup FFN.
Open Datasets Yes We use Ro BERTa language modeling pretraining (Liu et al., 2019a) as our evaluation tool to measure the method performance, since it is a challenging task. The models are pretrained using masked language modeling (Devlin et al., 2019) on the English Wikipedia corpus.
Dataset Splits Yes We pretrain each model for 250K steps with a batch size of 256, where each sequence is of length 512. (The paper evaluates using 'Log Perplexity', which is a common metric on a validation/test set to measure model quality during or after training, implying such a split or process. While not explicitly stating the split percentage, it's inherent to the RoBERTa pretraining task and perplexity evaluation.)
Hardware Specification Yes For CPU, we implemented these kernels in C++ using Open MP for inference which uses AVX2 vector instructions. and Tab. 5 shows the average per-iteration time for vanilla, Slide, and Mongoosebased FFN which is sized to match typical hyperparameters for a standard Transformer model on a modern AMD EPYC7452 (Zen 2) 32-core Server.
Software Dependencies No The paper mentions 'Py Torch (Paszke et al., 2019)' but does not specify a version number for it or other software dependencies like CUDA kernels or Open MP. It says: 'We used Py Torch (Paszke et al., 2019) for the majority of the implementation. On the GPU, our fast Hadamard Transform and weighted gather operators are not supported by Py Torch so we implemented custom CUDA kernels to support the operators for training. For CPU, we implemented these kernels in C++ using Open MP for inference which uses AVX2 vector instructions.'
Experiment Setup Yes We pretrain each model for 250K steps with a batch size of 256, where each sequence is of length 512. We use an Adam optimizer with 1e-4 learning rate, 10,000 warm-up steps, and linear decay.