Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

JAFAR: Jack up Any Feature at Any Resolution

Authors: Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel HAUGEARD, Matthieu Cord, Nicolas THOME

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page: https://jafar-upsampler.github.io 4 Experiments 4.1 Experimental Setup 4.2 Qualitative Comparisons 4.3 Transfer on Downstream Tasks 4.3.1 Semantic Segmentation 4.3.2 Depth Estimation 4.3.3 Class Activation Maps Faithfulness 4.3.4 Zero-Shot Open-Vocabulary Segmentation 4.3.5 Bird s-Eye View Segmentation 4.4 Ablations
Researcher Affiliation Collaboration Paul Couairon1,2 Loïck Chambon1,3 Louis Serrano1 Jean-Emmanuel Haugeard2 Matthieu Cord1,3 Nicolas Thome1,4 1Sorbonne Université, CNRS, ISIR, F-75005 Paris, France 2Thales, TSGF, cort AIx Labs, France 3Valeo.ai 4Institut Universitaire de France (IUF)
Pseudocode No The paper describes the architecture and method using textual descriptions, mathematical equations, and diagrams (Figure 2), but does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code No Project page: https://jafar-upsampler.github.io Neur IPS Paper Checklist 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The datasets used in the experiments are publicly available. Code will be released.
Open Datasets Yes 4.3.1 Semantic Segmentation For semantic segmentation, we train a linear projection head to predict coarse class labels using a cross-entropy loss across several benchmark datasets: COCO-Stuff [40] (27 classes), ADE20K [41] (150 classes), Pascal VOC [42] (21 classes including background), and Cityscapes [43] (27 classes). 4.3.5 Bird s-Eye View Segmentation Finally, we studied the impact of our upsampler in a complex training pipeline. The task, evaluated on nu Scenes [48], takes several images taken from cameras as input and consists on outputting the bird s-eye view (Be V) segmentation map. A.1 Evaluation To evaluate Class Activation Maps (CAMs), we employ a frozen pre-trained Vi T-B/16 model as the backbone and extract Grad-CAMs. We randomly sample 2,000 images from the Image Net validation set for which the model produces correct predictions.
Dataset Splits Yes 4.3.1 Semantic Segmentation For semantic segmentation, we train a linear projection head to predict coarse class labels using a cross-entropy loss across several benchmark datasets: COCO-Stuff [40] (27 classes), ADE20K [41] (150 classes), Pascal VOC [42] (21 classes including background), and Cityscapes [43] (27 classes). The linear layer is trained for 5 epochs on COCO-Stuff and 20 epochs on the remaining datasets, using a batch size of 4. Performance is evaluated on the respective validation sets using mean Intersection-over-Union (m Io U) and pixel-wise accuracy. 4.3.2 Depth Estimation We train the linear probe for 5 epochs on the COCO training set, using a batch size of 4. A.1 Evaluation To evaluate Class Activation Maps (CAMs), we employ a frozen pre-trained Vi T-B/16 model as the backbone and extract Grad-CAMs. We randomly sample 2,000 images from the Image Net validation set for which the model produces correct predictions.
Hardware Specification Yes 4.1 Experimental Setup In our experiments, we train JAFAR on a single NVIDIA A100 on Image Net training set for 100K steps using Adam W optimizer [39], with a learning rate of 2e 4 and a batch size of 4. E Performance We compare in Tabs. 12 and 13 the runtime and memory usage respectively of various methods with a batch size of 1 and input resolution of 448, across multiple target resolutions. The experiments are conducted on a single A100 GPU.
Software Dependencies No 4.1 Experimental Setup In our experiments, we train JAFAR on a single NVIDIA A100 on Image Net training set for 100K steps using Adam W optimizer [39], with a learning rate of 2e 4 and a batch size of 4. B.2 Class Activation Maps We present additional Grad-CAM visualizations based on Vi T-B/16 features from the Image Net validation set in Fig. 6. B.5 Attention Maps Visualization To illustrate the behavior of the upsampling module, we visualize attention maps in Fig. 9. B.4 Semantic Segmentation Fig. 8 presents examples of linear probe transfer learning for semantic segmentation on the COCOStuff dataset. The paper mentions specific tools and techniques like Adam W optimizer, Grad-CAM, and various backbone models (e.g., DINOv2 Vi T-S/14, CLIP-Vi T-B/16), and also refers to libraries like MMCV [52] and Pytorch image models [53], but does not provide specific version numbers for these software components.
Experiment Setup Yes 4.1 Experimental Setup In our experiments, we train JAFAR on a single NVIDIA A100 on Image Net training set for 100K steps using Adam W optimizer [39], with a learning rate of 2e 4 and a batch size of 4. The input images fed into the foundation vision encoder are resized to 448 448, producing high-resolution target feature maps Fhr of size 32 32 or 28 28, depending on the encoder s patch size (14 or 16). For improved training efficiency, the guidance image input to JAFAR is downsampled to 224 224. 4.3.1 Semantic Segmentation The linear layer is trained for 5 epochs on COCO-Stuff and 20 epochs on the remaining datasets, using a batch size of 4. 4.3.2 Depth Estimation We train the linear probe for 5 epochs on the COCO training set, using a batch size of 4. 4.3.5 Bird s-Eye View Segmentation We adopted the optimization hyperparameters from Point Be V [46], adjusting the batch size to 1 and training for 100 epochs.