Bifurcated Attention for Single-Context Large-Batch Sampling

Authors: Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experiments We first conduct experiments to see how capabilities scale with respect to model size for each attention type in Section 5.1.
Researcher Affiliation Industry 1Together.ai (work conducted at AWS) 2AWS NGDE Science 3GE Health Care (work conducted at AWS) 4Amazon AGI (work conducted at AWS) 5Goldman Sachs (work conducted at AWS).
Pseudocode Yes E.3. Implementation of Bifurcated Attention
Open Source Code Yes Link to our code: https://github.com/bifurcated-attn-icml2024/gpt-fast-parallel-sampling
Open Datasets Yes We use the average scores from two code generation benchmarks, multilingual Human Eval and MBXP (Athiwaratkun et al., 2022) and citations: (Chen et al., 2021) for Human Eval and (Austin et al., 2021) for MBXP.
Dataset Splits Yes Finally, a random split of 0.1% of the data was reserved as a validation set.
Hardware Specification Yes We use Nvidia A100 GPUs for inference hardware and The experiment results below utilize an Nvidia H100 GPU.
Software Dependencies No The paper mentions software like Py Torch Lightning, Deep Speed, Huggingface transformers, and GPTFast (Py Torch) but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We trained multiple models with varying sizes, ranging from 125 million parameters to 13 billion parameters, using code data with a context size of 2048 and adjusting the per-GPU batch size and total number of steps according to the model size. and Table 2: Training Hyperparameters lists Total Training Steps, Batch Size, Max Learning Rate. Also, Table 3: Model Specifications provides groups, dhead, nlayer.