Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery

Authors: Zhenqi He, Yuanpei Liu, Kai Han

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	SEAL consistently achieves state-of-the-art performance on ﬁnegrained benchmarks, including the SSB benchmark, Oxford-Pet, and the Herbarium19 dataset, and further demonstrates generalization on coarse-grained datasets. ... Through extensive experimentation on public GCD benchmarks, SEAL consistently demonstrates its effectiveness and achieves superior performance, especially on ﬁne-grained datasets.
Researcher Affiliation	Academia	Zhenqi He* Yuanpei Liu* Kai Han Visual AI Lab, The University of Hong Kong EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Dynamic Update of Mh
Open Source Code	No	We will release the codes and guidelines for reproducing the results after acceptance.
Open Datasets	Yes	We conduct a comprehensive evaluation of our method across a variety of benchmarks. The main paper reports results on the Semantic Shift Benchmark (SSB) [58], which covers ﬁnegrained datasets-CUB [60], Stanford Cars [34], and FGVC-Aircraft [42]-plus Oxford-Pet [46] and the more challenging Herbarium19 [55]. ... For all datasets, we follow the class split protocol of [57]
Dataset Splits	Yes	For all datasets, we follow the class split protocol of [57], where a subset of classes is selected as the known ( Old ) label set Yl. From these known classes, 50% of the samples are used to construct the labelled set Dl, and the remaining images with instances from novel classes form the unlabelled set Du. ... For CIFAR-100, 80% of the classes are designated as Old classes, while the remaining 20% as New classes. ... the model's hyperparameters are chosen based on its performance on a hold-out validation set, formed by the original test splits of labelled classes in each dataset.
Hardware Specification	Yes	All experiments are performed on a single NVIDIA L40S GPU with 24GB of memory.
Software Dependencies	No	All experiments utilize the Py Torch framework on a workstation with Nvidia L40s GPUs.
Experiment Setup	Yes	The model is trained for 200 epochs using a batch size of 128 and a cosine learning rate schedule, starting from an initial learning rate of 10-1 and decaying to 10-4. ... We perform hyperparameter tuning using a held-out validation split from the labelled data. Speciﬁcally, we tune the consistency temperature τc and the soft negative controller λs based on their performance on the Stanford Cars [34] dataset. ... optimal performance achieved when τc = 0.75 and λs = 1.0.