Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Consistent Supervised-Unsupervised Alignment for Generalized Category Discovery

Authors: Jizhou Han, Shaokun Wang, Yuhang He, Chenhao Ding, Qiang Wang, Xinyuan Gao, SongLin Dong, Yihong Gong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method achieves state-of-the-art performance on multiple GCD benchmarks, significantly enhancing novel category accuracy and demonstrating its effectiveness. 5 Experiments Our evaluation spans six image classification benchmarks, including both generic and fine-grained datasets. To evaluate the impact of Unsupervised ETF Alignment and Supervised ETF Alignment, we conduct ablation studies with four configurations: (1) Baseline (both disabled), (2) Supervised ETF only, (3) Unsupervised ETF only, and (4) Full model (both enabled). Table 3 presents the results, revealing key insights into each component s contribution.
Researcher Affiliation	Collaboration	Jizhou Han1, Shaokun Wang2 , Yuhang He1, Chenhao Ding3, Qiang Wang1, Xinyuan Gao4, Songlin Dong5, Yihong Gong1 1National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University 2School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 3School of Software Engineering, Xi an Jiaotong University 4Kuaishou Technology 5Faculty of Microelectronics, Shenzhen University of Advanced Technology
Pseudocode	Yes	Algorithm 1 Semantic Consistency Matcher (SCM)
Open Source Code	No	The datasets used are publicly available, and the code will be released upon paper acceptance to ensure reproducibility.
Open Datasets	Yes	Our evaluation spans six image classification benchmarks, including both generic and fine-grained datasets. For generic datasets, we use CIFAR-100 [37] and Image Net-100 [38]. For fine-grained datasets, we evaluate on CUB-200 [39], Stanford Cars [40], FGVC Aircraft [41], and Herbarium19 [42].
Dataset Splits	Yes	To separate categories into known and novel categories, we follow the SSB split protocol [43] for the fine-grained datasets. For CIFAR-100 and Image Net-100, we perform a random category split using a fixed seed, consistent with previous studies. Table 5: Distribution of labeled (known) and unlabeled (novel) categories across datasets. Labeled categories have annotated training samples and serve as known categories for supervision, while unlabeled categories represent novel categories that the model must discover without labels. Dataset Labeled Categories Unlabeled Categories CIFAR100 [37] 80 20 Image Net100 [38] 50 50 CUB-200-2011 [39] 100 100 Stanford Cars [40] 98 98 FGVC-Aircraft [41] 50 50 Herbarium19 [42] 341 342
Hardware Specification	Yes	All experiments are conducted on an NVIDIA 3090 GPU.
Software Dependencies	No	The paper mentions using a pre-trained DINO Vi T-B/16 model as the backbone network and a Ge LU activation function, but it does not specify explicit versions for core software libraries like Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup	Yes	The learning rate is set to 0.1. Other hyperparameters, including batch size, temperature τs, weight decay λ, and the number of augmentations, are set to 128, 0.07, 1e 4, and 2, respectively, in accordance with previous studies. All experiments are conducted on an NVIDIA 3090 GPU. Further implementation details can be found in the Appendix. For training, we use the SGD optimizer with a momentum of 0.9 and weight decay of 1e 4. We apply a cosine annealing learning rate scheduler to adapt the learning rate during the course of training. The method is trained for a total of 200 epochs. Additionally, data augmentation and random cropping are applied to enhance the model s ability to generalize across different transformations. In our model, we combine supervised and unsupervised contrastive losses and the ETF alignment loss. The unsupervised ETF alignment uses a clustering method for periodic feature grouping, while the supervised ETF alignment ensures that labeled data is aligned with their corresponding ETF prototypes. During training, we use a weighted random sampler to balance labeled and unlabeled data in each batch, allowing the model to learn from both known and novel categories effectively.