SS1: Accelerating Inference with Fast and Expressive Sketch Structured Transform
Authors: Aditya Desai, Kimia Saedi, Apoorv Walia, Jihyeong Lee, Keren Zhou, Anshumali Shrivastava
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We confirm empirically that SS1 offers better quality-efficiency tradeoffs than competing variants. We evaluate SS1 layers on a broad set of settings and use cases. A summary of our findings is below. |
| Researcher Affiliation | Collaboration | Kimia Saedi1 Aditya Desai1 Apoorv Walia1 Jihyeong Lee2 Keren Zhou2 Anshumali Shrivastava1,3 1Rice University 2George Mason University 3Ken Kennedy Institute, Third AI Corp., Xmad.ai |
| Pseudocode | Yes | Algorithm 1 SS1(Z, X) |
| Open Source Code | Yes | Our code is open-source.3 |
| Open Datasets | Yes | We use wikitext-103 [29] to train, evaluate, and test the model. MLPMixer on CIFAR (C10, C100)[30] and Tiny-Imagenet datasets [31]. We then finetune these models on the GLUE (General Language Understanding Evaluation) benchmark [18]. |
| Dataset Splits | No | The paper mentions evaluating on the 'test dataset' but does not explicitly provide details about training/validation/test splits, such as percentages or sample counts for each split. |
| Hardware Specification | Yes | MLPMixer experiments: Each run takes 2 hours on a single RTX Quadro 8000 machine GPT2-Small experiments: Each run takes around 13 hours on four 32-GB V100 GPUs BERT experiments: QQP, the largest dataset for finetuning with BERT-Large, takes around 7-8 hours on RTX-8000 Quadro. Llama experiments: on MMLU takes 1 hour on 1 40GB A100 |
| Software Dependencies | No | The paper mentions using 'PyTorch', 'Hugging Face Transformers library', and 'Triton' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For training we use the Adam W optimizer with α=6e 4, β1=0.9, β2=0.999, ϵ = 1e 08 and weight_decay = 0.1 we employ a linear schedule and wramup 1% of steps. The effective batch size is 512, which is not achievable on our hardware memory; thus, we perform gradient accumulation every 32 step to reach that. All models are trained for 100 epochs. The hyperparameters are adopted from the Monarch paper [16]. We use a fixed block size of Block_Size_K = 32, Block_Size_N = 32, and Block_Size_M = 64 for both the forward and backward kernels across all layers during training and testing. |