Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Oscillation-Reduced MXFP4 Training for Vision Transformers
Authors: Yuxiang Chen, Haocheng Xi, Jun Zhu, Jianfei Chen
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Vision Transformers demonstrate that Tetra Jet consistently outperforms the existing 4-bit training methods, and Q-EMA & Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than 50% compared to the baseline, and can even achieve competitive performance compared to full precision training. Table 2: Results on the 90-epoch pretraining of Vision Transformers. We report the Top-1 Accuracy% on validation dataset. |
| Researcher Affiliation | Academia | Yuxiang Chen 1 2 Haocheng Xi 3 Jun Zhu 1 Jianfei Chen 1 ... 1 Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University 2Zhili College, Tsinghua University 3University of California, Berkeley. Correspondence to: Jianfei Chen <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 EMA Quantizer for a Micro-Block (Q-EMA) ... Algorithm 2 Adaptive Ramping Algorithm for MXFP4 Training (Q-Ramping) ... in Appendix C. |
| Open Source Code | Yes | The codes are available at https://github.com/thu-ml/ Tetra Jet-MXFP4Training |
| Open Datasets | Yes | All the models are trained for 90 epochs on Image Net1K (Russakovsky et al., 2015) with default training recipes. |
| Dataset Splits | Yes | All the models are trained for 90 epochs on Image Net1K (Russakovsky et al., 2015) with default training recipes. |
| Hardware Specification | No | The paper mentions the 'next-generation Blackwell GPU architecture' as supporting MXFP4 format, but does not specify the hardware used for their own experiments. No specific GPU, CPU, or other hardware details for the experimental setup are provided. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in their experiments. |
| Experiment Setup | Yes | All the models are trained for 90 epochs on Image Net1K (Russakovsky et al., 2015) with default training recipes. For Q-EMA, the momentum β = 0.998 for calculating WEMA is a good default choice. For Q-Ramping, k1 = 16 is a good threshold to measure the severity of oscillation, and k2 = 5 is a default ratio for amplifying the Learning Rate & Batch Size (meanwhile, reducing the frequency of oscillation). |