Bifurcated Attention for Single-Context Large-Batch Sampling
Authors: Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experiments We first conduct experiments to see how capabilities scale with respect to model size for each attention type in Section 5.1. |
| Researcher Affiliation | Industry | 1Together.ai (work conducted at AWS) 2AWS NGDE Science 3GE Health Care (work conducted at AWS) 4Amazon AGI (work conducted at AWS) 5Goldman Sachs (work conducted at AWS). |
| Pseudocode | Yes | E.3. Implementation of Bifurcated Attention |
| Open Source Code | Yes | Link to our code: https://github.com/bifurcated-attn-icml2024/gpt-fast-parallel-sampling |
| Open Datasets | Yes | We use the average scores from two code generation benchmarks, multilingual Human Eval and MBXP (Athiwaratkun et al., 2022) and citations: (Chen et al., 2021) for Human Eval and (Austin et al., 2021) for MBXP. |
| Dataset Splits | Yes | Finally, a random split of 0.1% of the data was reserved as a validation set. |
| Hardware Specification | Yes | We use Nvidia A100 GPUs for inference hardware and The experiment results below utilize an Nvidia H100 GPU. |
| Software Dependencies | No | The paper mentions software like Py Torch Lightning, Deep Speed, Huggingface transformers, and GPTFast (Py Torch) but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We trained multiple models with varying sizes, ranging from 125 million parameters to 13 billion parameters, using code data with a context size of 2048 and adjusting the per-GPU batch size and total number of steps according to the model size. and Table 2: Training Hyperparameters lists Total Training Steps, Batch Size, Max Learning Rate. Also, Table 3: Model Specifications provides groups, dhead, nlayer. |