Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Few-Shot Audio-Visual Class-Incremental Learning with Temporal Prompting and Regularization
Authors: Yawen Cui, Li Liu, Zitong Yu, Guanjie Huang, Xiaopeng Hong
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This framework is validated in audiovisual classification tasks under the FS-AVCIL scenario, and extensive experiments demonstrate its superior performance. |
| Researcher Affiliation | Academia | 1 The Hong Kong Polytechnic University 2 The Hong Kong University of Science and Technology (Guangzhou) 3 Great Bay University 4 Harbin Institute of Technology EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology with equations and textual descriptions, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | Yes | AVE (Tian et al. 2018) dataset consists of events captured in 10-second video clips that feature both visual and auditory elements. It encompasses a total of 28 different event categories, with a collection of 4,143 videos. [...] Kinetics-Sounds is derived from the larger Kinetics-400 dataset (Kay et al. 2017). It consists of around 24,000 video clips, each with a duration of 10 seconds, and these clips are categorized into 32 classes of human actions. [...] We loaded Vi T-Base model pretrained on Image Net-21K (Deng et al. 2009). |
| Dataset Splits | Yes | In our FS-AVCL setting, we sample 8 categories as base classes perceived in the first session, and the remaining 20 classes are encountered in the following incremental learning sessions. We use all the training samples of the 8 classes in the first session and the tasks of the incremental sessions are all under the few-shot learning scenario, i.e., each novel class only contains limited annotated samples. We adopt comprehensive configurations of the incremental sessions: (1) 20-way 5-shot, (2) 10-way 5-shot, (3) 5-way 5-shot, and (3) 2-way 5-shot. [...] We initially selected 12 categories as the base classes for the first learning session. The subsequent incremental learning sessions then introduce the remaining 20 classes. We use all training samples from the 12 base classes during the first session. We implement a variety of configurations for the incremental learning sessions to thoroughly evaluate our approach. These configurations include: (1) 20-way 5-shot, (2) 10-way 5-shot, (3) 5-way 5-shot, and (3) 2-way 5-shot. |
| Hardware Specification | Yes | Our framework was implemented using Pytorch, and all experiments were conducted on NVIDIA A100 GPU. |
| Software Dependencies | No | Our framework was implemented using Pytorch, and all experiments were conducted on NVIDIA A100 GPU. The paper mentions Pytorch but does not specify a version number or other software dependencies with versioning. |
| Experiment Setup | Yes | During training, we froze all the parameters of 12 transformer blocks except for the Linear Projection of the audio branch and MLP head. The dimension of tokens is 768, and the number of latent tokens is 2. The proposed TRP-AVA is parallel with the MLP layer and MHSA in each transformer block. Besides, the downsampling dimension in TRP-AVA is 8, and the temporal kernel size k in 1D convolution operation is 5. For TPR, we use the temporal prompts of last 6 transformer blocks and the regulization loss weight λ is 0.3. Besides, we used Adam optimizer, and the model was trained for 30 epochs with a batch size of 2 (i.e., 2 videos and 8 frames each.) and a learning rate of 0.0003. |