Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Oscillation-Reduced MXFP4 Training for Vision Transformers

Authors: Yuxiang Chen, Haocheng Xi, Jun Zhu, Jianfei Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on Vision Transformers demonstrate that Tetra Jet consistently outperforms the existing 4-bit training methods, and Q-EMA & Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than 50% compared to the baseline, and can even achieve competitive performance compared to full precision training. Table 2: Results on the 90-epoch pretraining of Vision Transformers. We report the Top-1 Accuracy% on validation dataset.
Researcher Affiliation	Academia	Yuxiang Chen 1 2 Haocheng Xi 3 Jun Zhu 1 Jianfei Chen 1 ... 1 Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University 2Zhili College, Tsinghua University 3University of California, Berkeley. Correspondence to: Jianfei Chen <EMAIL>.
Pseudocode	Yes	Algorithm 1 EMA Quantizer for a Micro-Block (Q-EMA) ... Algorithm 2 Adaptive Ramping Algorithm for MXFP4 Training (Q-Ramping) ... in Appendix C.
Open Source Code	Yes	The codes are available at https://github.com/thu-ml/ Tetra Jet-MXFP4Training
Open Datasets	Yes	All the models are trained for 90 epochs on Image Net1K (Russakovsky et al., 2015) with default training recipes.
Dataset Splits	Yes	All the models are trained for 90 epochs on Image Net1K (Russakovsky et al., 2015) with default training recipes.
Hardware Specification	No	The paper mentions the 'next-generation Blackwell GPU architecture' as supporting MXFP4 format, but does not specify the hardware used for their own experiments. No specific GPU, CPU, or other hardware details for the experimental setup are provided.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in their experiments.
Experiment Setup	Yes	All the models are trained for 90 epochs on Image Net1K (Russakovsky et al., 2015) with default training recipes. For Q-EMA, the momentum β = 0.998 for calculating WEMA is a good default choice. For Q-Ramping, k1 = 16 is a good threshold to measure the severity of oscillation, and k2 = 5 is a default ratio for amplifying the Learning Rate & Batch Size (meanwhile, reducing the frequency of oscillation).