Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials
Authors: Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, shao xinyu, Gao Huang, Xiu Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, VCA lifts Dei T-Tiny top-1 accuracy on Image Net-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical Vi Ts by up to 3.1%, while in class-conditional Image Net generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (Di T) and flow (Si T) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. |
| Researcher Affiliation | Academia | 1 Tsinghua University 2 Peking University |
| Pseudocode | No | The paper describes the Visual Contrast Attention mechanism through mathematical equations and textual descriptions in Section 3.2 "Our Approach" but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code is available at https://github.com/Leap Lab THU/Linear Diff. |
| Open Datasets | Yes | The Image Net-1K [7] recognition dataset contains 1.28M training images and 50K validation images with a total of 1,000 classes. For image recognition experiments, images are trained and evaluated in 224 224 size. The top-1 accuracy on the validation set is adopted as the evaluation metric. |
| Dataset Splits | Yes | The Image Net-1K [7] recognition dataset contains 1.28M training images and 50K validation images with a total of 1,000 classes. For image recognition experiments, images are trained and evaluated in 224 224 size. The top-1 accuracy on the validation set is adopted as the evaluation metric. For image generation tasks, we train and evaluate the images in 256 256 size, following the commonly used practice in class-condition generation. We use FID-50K as the evaluation metric, which measures the Frรฉchet distance between the Inception-V3 features of 50 000 generated images and 50 000 real validation images. |
| Hardware Specification | No | The paper does not explicitly specify the hardware used for running the experiments. It mentions the computational cost and implies significant computation but no details on GPU/CPU models or specific hardware setups are provided. |
| Software Dependencies | No | The paper mentions optimizers and data augmentation techniques like Adam W [48], Rand Augment [6], Mixup [93], Cut Mix [92], random erasing [100], and EMA [57]. However, it does not provide specific version numbers for these software components or the main deep learning framework used (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | For image recognition experiments, we use the same training setup as the baseline models to ensure fair comparison. All models are trained from scratch using the Adam W [48] optimizer for 300 epochs. We apply cosine learning rate decay, starting with 20 epochs of linear warm-up, and set the initial learning rate to 1 10 3 with a weight decay of 0.05. The data augmentation and regularization methods include Rand Augment [6], Mixup [93], Cut Mix [92], and random erasing [100]. We also follow CSwin [10] and use EMA [57] during training. For image generation tasks, we follow Di T [56] and Si T [52] to train class-conditional diffusion transformer models on the Image Net-1K [8] dataset. All models are trained with the Adam W [40, 49] optimizer and no weight decay. For 256 256 resolution, we train from scratch with a global batch size of 256 for 400,000 iterations. The learning rate is kept constant at 1 10 4. We use only random horizontal flip for data augmentation during training. Additionally, we apply exponential moving average (EMA) to the model weights with a decay rate of 0.9999. |