Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
First Attentions Last: Better Exploiting First Attentions for Efficient Parallel Training
Authors: Gyudong Kim, Hyukju Na, Jin Kyu Kim, Hyunsung Jang, Jaemin Park, Jaegi Hwang, NAMKOO HA, Seungryong Kim, Young Geun Kim
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation shows that FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18 , and achieves better perplexity compared to the baseline GPT. |
| Researcher Affiliation | Collaboration | 1Korea University 2LIG Nex1 Co., Ltd. 3KAIST AI |
| Pseudocode | No | The paper describes the proposed methods and architectures (FAL and FAL+) using textual explanations and mathematical formulations (Equations 1, 2, 3, 4, 5, 6), as well as block diagrams (Figure 1), but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes are available at: https://casl-ku.github.io/FAL/ |
| Open Datasets | Yes | Datasets: We pre-train the models on Open Web Text corpus [38], a publicly available counterpart to GPT-2 s Web Text. For scalability analysis, we use the Pile dataset [39]. Zero-shot performance is evaluated on language understanding tasks using the Super GLUE benchmark suite [8]. |
| Dataset Splits | No | The paper mentions using 'Openwebtext validation perplexity' and evaluating 'zero-shot results on the Super GLUE benchmark', which implies the use of validation sets and predefined splits from these standard datasets. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for how these splits were performed or custom-generated for its own experiments. For instance, for Open Web Text, it doesn't state what proportion was used for training vs. validation. |
| Hardware Specification | Yes | Hardware: In order to comprehensively evaluate our approach across diverse GPU architectures and scales, we conduct experiments on multi-GPU configurations (2 8 GPUs) with RTX 3090 and H200 devices connected via PCIe or NVLink, and on single-GPU setups with RTX 3090, RTX 4090, and RTX A6000. |
| Software Dependencies | Yes | We performed the experiments using Py Torch and Colossal-AI on our server and a public cloud service. Common Settings Version of Py Torch: 2.2.2 Version of CUDA: 12.3 Version of Colossal-AI: 0.4.0 |
| Experiment Setup | Yes | Table 1 We train each architecture on Open Web Text [38], an open-source replication of the Web Text dataset originally used to train GPT-2. The dataset comprises approximately 41.7 GB of text, corresponding to 4 billion tokens. Given our limited computational resources, we use a computeefficient batch size of 32, which has been shown to be sufficient for stable hyperparameter transfer in ยตP-based training [62, 63]. To evaluate language understanding performance, we report zeroshot results on the Super GLUE benchmark [8], which includes Bool Q [42], CB [43], COPA [44], Multi RC [45], Re Co RD [46], RTE [47], Wi C [48], and WSC [49]. No finetuning or additional training was performed on any task. CB and Re Co RD are evaluated using F1 score, while the remaining tasks use accuracy. System: 1 Epochs: 1 GPU#: 4 Model: GPT-2 774M, 1.5B Parallel setting: 2TP/2DP Total batchsize: 32 (used gradient accumulation) Sequence Length: 1024 Learning rate: 0.0001 Weight decay: 0.001 clip-grad-norm: 1 embd-pdrop: 0.1 |