Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Tensor-Parallelism with Partially Synchronized Activations
Authors: Itay Lamprecht, Asaf Karnieli, Yair Hanani, Niv Giladi, Daniel Soudry
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train a 7B parameter CAAT-Net model and show that tensor-parallel communication can be reduced by up to 50% with no significant drop in pretraining accuracy across nearly all evaluated benchmarks. We also experiment with smaller 130M and 1.1B models to show the robustness and scalability of our method. (Abstract) Experiments were conducted with Intel Gaudi3 HPU accelerators. (Section 5) |
| Researcher Affiliation | Collaboration | Itay Lamprecht Asaf Karnieli Yair Hanani Niv Giladi Daniel Soudry Intel, Israel Department of Electrical and Computer Engineering Technion, Haifa, Israel AWS AI Labs EMAIL {ngiladi}@amazon.com EMAIL |
| Pseudocode | No | The paper describes methods through mathematical equations and figures (e.g., Figure 1 and 2) but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/itlamp/Megatron-LM-comms |
| Open Datasets | Yes | Training was conducted from scratch over 160B tokens from the Red Pajama dataset, spanning 8 nodes, each containing 8 accelerators. Both models are trained on the Red Pajama dataset, using the GPTSentence Piece tokenizer. |
| Dataset Splits | No | The paper mentions training on '160B tokens from the Red Pajama dataset' and evaluates on 'a diverse set of common sense tasks selected from the Language Model Evaluation Harness framework'. While validation loss is reported, specific training, validation, and test splits (e.g., percentages or sample counts) for the main training datasets are not explicitly detailed. |
| Hardware Specification | Yes | Experiments were conducted with Intel Gaudi3 HPU accelerators. Gaudi3 has 128GB on-board memory. Each device has 525 GB/s intra-node connection and 75 GB/s inter-node connection. Our experiments were done using the gpt-fast repository, and consisted of replacing all-reduce with partial channel reduce during inference. We conducted experiments on 8 NVIDIA H100-80GB-HBM3, and 8 NVIDIA A100-SXM4-80GB, both with NVLink. |
| Software Dependencies | No | The paper mentions using 'Intel s Megatron-LM fork', 'Optimum-Habana', 'gpt-fast repository', 'GPTSentence Piece tokenizer', and 'Adam W optimizer'. However, specific version numbers for these software dependencies are not provided. |
| Experiment Setup | Yes | We trained a variation of Llama2-7b [8] with partial channel reduce. Training was conducted from scratch over 160B tokens from the Red Pajama dataset, spanning 8 nodes, each containing 8 accelerators. We chose a partial channel reduce hyperparameter of p = 0.5 and a tensor-parallel dimension of 8, with all other hyperparameters identical to those used in training of the original model. ... The 130M parameter model... has 16 attention heads and a hidden dimension size of 768. It was trained with an initial learning rate of 6 10-4, with the Adam W optimizer. The training was performed with a global batch size of 256 and a sequence length of 1024. The architecture consists of 12 transformer layers with multi-head attention. It is trained over 7.8B tokens. ... Similarly, we trained GPT3-XL [2] with tensor-parallel 8 and p = 0.5, on the Red Pajama dataset using the GPTSentence Piece tokenizer, rotary positional embeddings and with a global batch size of 512. We trained for a total of 50B tokens. |