SS1: Accelerating Inference with Fast and Expressive Sketch Structured Transform

Authors: Aditya Desai, Kimia Saedi, Apoorv Walia, Jihyeong Lee, Keren Zhou, Anshumali Shrivastava

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We confirm empirically that SS1 offers better quality-efficiency tradeoffs than competing variants. We evaluate SS1 layers on a broad set of settings and use cases. A summary of our findings is below.
Researcher Affiliation Collaboration Kimia Saedi1 Aditya Desai1 Apoorv Walia1 Jihyeong Lee2 Keren Zhou2 Anshumali Shrivastava1,3 1Rice University 2George Mason University 3Ken Kennedy Institute, Third AI Corp., Xmad.ai
Pseudocode Yes Algorithm 1 SS1(Z, X)
Open Source Code Yes Our code is open-source.3
Open Datasets Yes We use wikitext-103 [29] to train, evaluate, and test the model. MLPMixer on CIFAR (C10, C100)[30] and Tiny-Imagenet datasets [31]. We then finetune these models on the GLUE (General Language Understanding Evaluation) benchmark [18].
Dataset Splits No The paper mentions evaluating on the 'test dataset' but does not explicitly provide details about training/validation/test splits, such as percentages or sample counts for each split.
Hardware Specification Yes MLPMixer experiments: Each run takes 2 hours on a single RTX Quadro 8000 machine GPT2-Small experiments: Each run takes around 13 hours on four 32-GB V100 GPUs BERT experiments: QQP, the largest dataset for finetuning with BERT-Large, takes around 7-8 hours on RTX-8000 Quadro. Llama experiments: on MMLU takes 1 hour on 1 40GB A100
Software Dependencies No The paper mentions using 'PyTorch', 'Hugging Face Transformers library', and 'Triton' but does not provide specific version numbers for these software components.
Experiment Setup Yes For training we use the Adam W optimizer with α=6e 4, β1=0.9, β2=0.999, ϵ = 1e 08 and weight_decay = 0.1 we employ a linear schedule and wramup 1% of steps. The effective batch size is 512, which is not achievable on our hardware memory; thus, we perform gradient accumulation every 32 step to reach that. All models are trained for 100 epochs. The hyperparameters are adopted from the Monarch paper [16]. We use a fixed block size of Block_Size_K = 32, Block_Size_N = 32, and Block_Size_M = 64 for both the forward and backward kernels across all layers during training and testing.