Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

REOrdering Patches Improves Vision Models

Authors: Declan Kutscher, David Chan, Yutong Bai, Trevor Darrell, Ritwik Gupta

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To study whether such datasets are susceptible to patch ordering effects to different degrees, we run experiments on two datasets: Image Net-1K [20] (natural images) and Functional Map of the World [21] (satellite).
Researcher Affiliation Academia 1University of Pittsburgh 2University of California, Berkeley
Pseudocode Yes Algorithm 1 REOrder with a Plackett-Luce policy
Open Source Code Yes Code and animations are available on the project page. (...) The anonymized repository for this work is made available here.
Open Datasets Yes We run experiments on two datasets: Image Net-1K [20] (natural images) and Functional Map of the World [21] (satellite). (...) Image Net-1K is obtained from the official download portal and Functional Map of the World is obtained from their official AWS S3 bucket.
Dataset Splits Yes We train on their respective training sets and report results on the validation sets. (...) We evaluated top-1 accuracy on validation sets, estimating the Standard Error of the Mean (SEM) using a non-parametric bootstrap method with 2,000 resamples.
Hardware Specification Yes Experiments are conducted on machines equipped with either 8 80GB A100 GPUs or 4 40GB A100 GPUs.
Software Dependencies No We utilized the timm implementation for the Vision Transformer (Vi T) and the Hugging Face implementation for Longformer. Both were adapted with minor modifications to accommodate varying patch permutations. TXL is based on the official implementation and includes a newly introduced, learned absolute position embedding to account for changing patch orders across batches. We use ARM [9] as our vision Mamba model of choice due to its training stability. For all of the models, the image size is 224 224 with a patch size of 16 16. The Transformer-XL memory length (M) was set to 128 and the attention window size (Mlocal) for Longformer was set to 14. All four models prepend a learnable class [CLS] token as a fixed-length representation for image classification. The [CLS] token is always retained as the first token in the sequence. All models use their respective Base configurations. Complete details about the model configurations are in Appendix C.
Experiment Setup Yes All models are trained for 100 epochs the Adam W optimizer using β1 = 0.9, β2 = 0.999, weight decay of 0.03, and a base learning rate of α = 1.0 10 4. Batch sizes are held constant for all runs across all model-dataset pairs (details in Appendix D). We apply cosine learning rate decay with a linear warmup over 5 epochs. For the reinforcement learning experiments introduced in Section 6, we use the same optimizer configuration but with a reduced base learning rate of α = 1.0 10 5 and no decay.