Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

An Effective Levelling Paradigm for Unlabeled Scenarios

Authors: Fangming Cui, Zhou Yu, Di Yang, Yuqiang Ren, Liang Xiao, Xinmei Tian

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments have shown that our design can effectively address generalized scenarios and tasks. Representative tasks across 11 real datasets on generalization from base-to-novel, cross-dataset generalization, and domain generalization demonstrate that our design can effectively address generalized scenarios and tasks.
Researcher Affiliation	Collaboration	Fangming Cui1,2 Zhou Yu3 Di Yang4 Yuqiang Ren4 Liang Xiao1 Xinmei Tian5 1Defense Innovation Institute 2Shanghai Jiao Tong University 3The Key Laboratory of Complex Systems Modeling and Simulation, the School of Computer Science, Hangzhou Dianzi University, China 4Byte Dance Inc. 5Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Pseudocode	No	The paper describes the proposed method, Levelling Paradigm (Le Pa), using text and mathematical equations (e.g., Equation 5 and 6), but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We are organizing the code.
Open Datasets	Yes	The open-source real datasets cover multiple recognition tasks. We conducted base-to-novel generalization experiments and cross-dataset generalization experiments on 11 datasets. We conduct domain generalization experiments on four variants of Image Net [35]. The datasets encompass various recognition tasks, including Image Net [35], Caltech101 [36] for generic objects, Oxford Pets [37], Stanford Cars [38], Flowers102 [39], Food101 [40], FGVCAircraft [41] for fine-grained classification, SUN397 [42] for scene recognition, UCF101 [43] for action recognition, DTD [44] for texture classification, and Euro SAT [45] for satellite images. For the domain generalization benchmark, we use Image Net A [46], Image Net-R [47], Image Net-Sketch [48] and Image Net V2 [49].
Dataset Splits	Yes	We follow a setting where the same datasets are split into base and novel classes. So the distribution of the base classes is similar to that of the novel classes. Dividing all classes of a dataset into two parts is a process of random division. Please note that this is divided into two equally sized parts. The model is trained only on the base classes in a 16-shot setting and tested on base (non-generalization task) and novel classes (generalization task).
Hardware Specification	Yes	We use deep prompting with multi-modal encoders and an SGD optimizer with a learning rate of 0.0026 on a single A5000 GPU.
Software Dependencies	No	The paper mentions using an SGD optimizer and a CLIP model, but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We set textual and visual embeddings to 4 based on all VLP methods. We use deep prompting with multi-modal encoders and an SGD optimizer with a learning rate of 0.0026 on a single A5000 GPU. Training for 30 epochs for base-to-novel generalization by 16-shot, 20 epochs for domain generalization and cross-dataset evaluation setting. We train the Image Net source model on all classes with 16-shot in the first 3 transformer layers for domain generalization and cross-dataset evaluation. We set γ1 = γ2 = 5 for multi-modal regularization in total loss. We set w = 4 for w-worst cases. For the base-to-novel generalization, we set the learning depth to 9. We fix N = 60 hand-crafted prompts [1, 22], following CLIP.