Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the rankability of visual embeddings

Authors: Ankit Sonthalia, Arnas Uselis, Seong Joon Oh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address (1), we evaluate 7 modern visual encoders, from Res Net to CLIP, across 9 datasets with 7 attributes: age, crowd count, 3 head pose angles (pitch, roll, yaw), image aesthetics, and recency. We find that many embedding spaces are indeed rankable (Section 3).
Researcher Affiliation	Academia	Ankit Sonthalia University of Tübingen EMAIL Arnas Uselis University of Tübingen EMAIL Seong Joon Oh University of Tübingen
Pseudocode	No	The paper describes methods in prose, such as the definition of rankability and experimental procedures in Section 3 and 4, but does not include a distinct pseudocode or algorithm block.
Open Source Code	Yes	Our code is available at https://github.com/aktsonthalia/rankablevision-embeddings.
Open Datasets	Yes	In total, we use 9 datasets, covering 7 attributes. We provide a detailed breakdown in Table 1. ... UTKFace, introduced in [83], is a dataset of face images with age labels ranging from 0 to 116. ... The dataset was downloaded from the official website (https://susanqq.github.io/UTKFace/). ... UCF-QNRF, introduced in [7], is a large crowd counting dataset ... We use the official download link at https://www.crcv.ucf.edu/data/ucf-qnrf/ and the official train-test splits.
Dataset Splits	Yes	Table 1: Datasets and attributes. Summary of datasets used for evaluating the rankability of visual representations. ... Age Adience [11] 14k 4k 8 age groups Official 5-fold ... Crowd count UCF-QNRF [7] 1,201 334 49 12,865 Official ... Yaw BIWI Kinect 10,493 4,531 75 6 test seqs. ... We use the split provided by [78] in their official repository (https://github.com/uynaes/Ranking Aware CLIP/tree/main/examples).
Hardware Specification	Yes	All experiments were conducted on a single NVIDIA A100 GPU with 40GB of memory.
Software Dependencies	No	The paper mentions using the stuned Python library [55] and models obtained from timm [72], but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Linear and nonlinear regression. We test 30 random hyperparameter configurations per datasetmodel pair. The initial learning rate is sampled from a log-uniform distribution over [10 6, 10 1] and decayed over a cosine schedule to zero, while the weight decay is sampled from a log-uniform distribution over [10 7, 10 4]. Data augmentation (horizontal flipping) is also toggled on or off randomly for a given run. We use 100 epochs and a batch size of 128 throughout. For nonlinear regression, we use a 2-layer MLP with 128 hidden dimensions and Re LU non-linearity.