Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation
Authors: Jingmin Zhu, Anqi Zhu, Hossein Rahmani, Jun Liu, Mohammed Bennamoun, Qiuhong Ke
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zeroshot and generalized zero-shot settings. |
| Researcher Affiliation | Academia | Jingmin Zhu Monash University EMAIL Anqi Zhu Monash University EMAIL Hossein Rahmani Lancaster University EMAIL Jun Liu Lancaster University EMAIL Mohammed Bennamoun University of Western Australia EMAIL Qiuhong Ke Monash University EMAIL |
| Pseudocode | Yes | This section provides a formal description of the Skeleton-Cache algorithm through pseudocode and clarifies the notation used throughout our paper. Algorithm 1 Skeleton-Cache: Training-Free Test-Time Adaptation for SZAR |
| Open Source Code | Yes | The code is publicly available at https: //github.com/Alchemist0754/Skeleton-Cache. |
| Open Datasets | Yes | Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zeroshot and generalized zero-shot settings. NTU RGB+D 60 Dataset [20]. NTU RGB+D 120 Dataset [16]. PKU-MMD Dataset [4]. |
| Dataset Splits | Yes | Our method follows the settings of existing baseline methods, adopting the split settings as [8], with all model training conducted on the seen train dataset. Due to fixed splits potentially not reflecting method performance well, recent papers (SMIE [32], SA-DVAE [15], etc.) have established a new setting where they implement three different random class splits of 55/5, 110/10, and 46/5 for NTU60, NTU-120, and PKU-MMD datasets respectively. Our paper has also completed this baseline, with results shown in Table 2. NTU RGB+D 60 NTU RGB+D 120 55/5 Split 48/12 Split 110/10 Split 96/24 Split |
| Hardware Specification | Yes | All experiments run on a single NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions GPT-4o and gpt-4-turbo as LLM APIs used for generating weights, indicating specific models of the LLM. However, it does not provide version numbers for any other key software components or libraries (e.g., Python, PyTorch, CUDA, NumPy) that would be typically required to replicate the experiments. |
| Experiment Setup | Yes | The cache size is fixed to K = 8 prototypes per unseen class; the body-part granularity is P = 4 (head, torso, arms, legs) and the temporal segmentation is Z = 3 (begin, middle, end). We use a test batch size of 1 to emulate streaming deployment. The balancing coefficient ιs (Fig. 2b) shows optimal performance around 5.0, balancing the influence of cache-retrieved logits against the original zero-shot predictions. The temperature parameter β (Fig. 2c) reaches peak performance at 3.0, providing the ideal sharpness for the similarity distribution used in descriptor comparison. |