reproducibilityindex.ai

Bifurcated Attention for Single-Context Large-Batch Sampling

Authors: Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5. Experiments We first conduct experiments to see how capabilities scale with respect to model size for each attention type in Section 5.1.
Researcher Affiliation	Industry	1Together.ai (work conducted at AWS) 2AWS NGDE Science 3GE Health Care (work conducted at AWS) 4Amazon AGI (work conducted at AWS) 5Goldman Sachs (work conducted at AWS).
Pseudocode	Yes	E.3. Implementation of Bifurcated Attention
Open Source Code	Yes	Link to our code: https://github.com/bifurcated-attn-icml2024/gpt-fast-parallel-sampling
Open Datasets	Yes	We use the average scores from two code generation benchmarks, multilingual Human Eval and MBXP (Athiwaratkun et al., 2022) and citations: (Chen et al., 2021) for Human Eval and (Austin et al., 2021) for MBXP.
Dataset Splits	Yes	Finally, a random split of 0.1% of the data was reserved as a validation set.
Hardware Specification	Yes	We use Nvidia A100 GPUs for inference hardware and The experiment results below utilize an Nvidia H100 GPU.
Software Dependencies	No	The paper mentions software like Py Torch Lightning, Deep Speed, Huggingface transformers, and GPTFast (Py Torch) but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We trained multiple models with varying sizes, ranging from 125 million parameters to 13 billion parameters, using code data with a context size of 2048 and adjusting the per-GPU batch size and total number of steps according to the model size. and Table 2: Training Hyperparameters lists Total Training Steps, Batch Size, Max Learning Rate. Also, Table 3: Model Specifications provides groups, dhead, nlayer.