Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation

Authors: Jingmin Zhu, Anqi Zhu, Hossein Rahmani, Jun Liu, Mohammed Bennamoun, Qiuhong Ke

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zeroshot and generalized zero-shot settings.
Researcher Affiliation	Academia	Jingmin Zhu Monash University EMAIL Anqi Zhu Monash University EMAIL Hossein Rahmani Lancaster University EMAIL Jun Liu Lancaster University EMAIL Mohammed Bennamoun University of Western Australia EMAIL Qiuhong Ke Monash University EMAIL
Pseudocode	Yes	This section provides a formal description of the Skeleton-Cache algorithm through pseudocode and clarifies the notation used throughout our paper. Algorithm 1 Skeleton-Cache: Training-Free Test-Time Adaptation for SZAR
Open Source Code	Yes	The code is publicly available at https: //github.com/Alchemist0754/Skeleton-Cache.
Open Datasets	Yes	Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zeroshot and generalized zero-shot settings. NTU RGB+D 60 Dataset [20]. NTU RGB+D 120 Dataset [16]. PKU-MMD Dataset [4].
Dataset Splits	Yes	Our method follows the settings of existing baseline methods, adopting the split settings as [8], with all model training conducted on the seen train dataset. Due to fixed splits potentially not reflecting method performance well, recent papers (SMIE [32], SA-DVAE [15], etc.) have established a new setting where they implement three different random class splits of 55/5, 110/10, and 46/5 for NTU60, NTU-120, and PKU-MMD datasets respectively. Our paper has also completed this baseline, with results shown in Table 2. NTU RGB+D 60 NTU RGB+D 120 55/5 Split 48/12 Split 110/10 Split 96/24 Split
Hardware Specification	Yes	All experiments run on a single NVIDIA RTX 4090 GPU.
Software Dependencies	No	The paper mentions GPT-4o and gpt-4-turbo as LLM APIs used for generating weights, indicating specific models of the LLM. However, it does not provide version numbers for any other key software components or libraries (e.g., Python, PyTorch, CUDA, NumPy) that would be typically required to replicate the experiments.
Experiment Setup	Yes	The cache size is fixed to K = 8 prototypes per unseen class; the body-part granularity is P = 4 (head, torso, arms, legs) and the temporal segmentation is Z = 3 (begin, middle, end). We use a test batch size of 1 to emulate streaming deployment. The balancing coefficient αs (Fig. 2b) shows optimal performance around 5.0, balancing the influence of cache-retrieved logits against the original zero-shot predictions. The temperature parameter β (Fig. 2c) reaches peak performance at 3.0, providing the ideal sharpness for the similarity distribution used in descriptor comparison.