Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bifurcated Attention for Single-Context Large-Batch Sampling
Authors: Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experiments We first conduct experiments to see how capabilities scale with respect to model size for each attention type in Section 5.1. |
| Researcher Affiliation | Industry | 1Together.ai (work conducted at AWS) 2AWS NGDE Science 3GE Health Care (work conducted at AWS) 4Amazon AGI (work conducted at AWS) 5Goldman Sachs (work conducted at AWS). |
| Pseudocode | Yes | E.3. Implementation of Bifurcated Attention |
| Open Source Code | Yes | Link to our code: https://github.com/bifurcated-attn-icml2024/gpt-fast-parallel-sampling |
| Open Datasets | Yes | We use the average scores from two code generation benchmarks, multilingual Human Eval and MBXP (Athiwaratkun et al., 2022) and citations: (Chen et al., 2021) for Human Eval and (Austin et al., 2021) for MBXP. |
| Dataset Splits | Yes | Finally, a random split of 0.1% of the data was reserved as a validation set. |
| Hardware Specification | Yes | We use Nvidia A100 GPUs for inference hardware and The experiment results below utilize an Nvidia H100 GPU. |
| Software Dependencies | No | The paper mentions software like Py Torch Lightning, Deep Speed, Huggingface transformers, and GPTFast (Py Torch) but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We trained multiple models with varying sizes, ranging from 125 million parameters to 13 billion parameters, using code data with a context size of 2048 and adjusting the per-GPU batch size and total number of steps according to the model size. and Table 2: Training Hyperparameters lists Total Training Steps, Batch Size, Max Learning Rate. Also, Table 3: Model Specifications provides groups, dhead, nlayer. |