Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reliable and Diverse Hierarchical Adapter for Zero-shot Video Classification
Authors: Wenxuan Ge, Peng Huang, Rui Yan, Hongyu Qu, Guosen Xie, Xiangbo Shu
IJCAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on four popular video classification benchmarks demonstrate the effectiveness of Hierarchical Adapter. The code is available at https://github.com/Gwxer/Hierarchical-Adapter. [...] Extensive experiments over four benchmarks demonstrate that the reliable and diverse hierarchical adapter achieves superior performance while maintaining competitive computational efficiency. |
| Researcher Affiliation | Academia | Nanjing University of Science and Technology EMAIL, EMAIL |
| Pseudocode | Yes | For clarity, we provide the whole cache update process in Algorithm 1 in the form of pseudo-code. |
| Open Source Code | Yes | Experiments on four popular video classification benchmarks demonstrate the effectiveness of Hierarchical Adapter. The code is available at https://github.com/Gwxer/Hierarchical-Adapter. |
| Open Datasets | Yes | HMDB-51 [Kuehne et al., 2011] is a small-scale action recognition dataset. [...] UCF-101 [Soomro, 2012] consists of 13,320 videos covering 101 categories, which can be further grouped into five main categories: Body motion, Human-human interactions, Human-object interactions, Playing instruments, and Sports. Kinetics-600 [Carreira et al., 2018] is a large-scale video dataset, containing 600 human action classes, with at least 600 video clips for each action. [...] Activity Net-200 [Fabian Caba Heilbron and Niebles, 2015] is also a large-scale action recognition benchmark |
| Dataset Splits | No | The paper mentions evaluating on specific datasets (HMDB-51, UCF-101, Kinetics-600, Activity Net-200) and using a validation set for hyperparameter search on Kinetics-400, but does not explicitly provide the training/test/validation split percentages or sample counts for any of these datasets in the main text. While these are standard benchmarks, the specific splits used are not detailed. |
| Hardware Specification | Yes | All the experiments are conducted using a single NVIDIA 3090 24GB GPU. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, or CUDA versions). |
| Experiment Setup | Yes | We utilize a pre-trained Vi T-B/16 of CLIP as the foundation model, and the model is not fine-tuned on extra large video datasets. In test-time adaption, we sample T = 32 frames from each test video. We use top-1 accuracy(%) as our evaluation metric. We perform a search for hyperparameter on the validation set of Kinetics-400. In FCR, we select 8 frames based on prediction entropy, and subsequently select 5 frames based on TPD to construct refined video embeddings. When calculating TPD, each frame is divided into 7 7 image patches, and temporal shuffling is applied between adjacent 2 frames. In Algorithm 1, cache size n is set as 10 and similarity threshold ฯ is 0.95. In Eq. 2, ฮฒ is 8 according to TDA, and in Eq. 7, ยต is set to 0.5. |