STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment
Authors: Jaewoo Lee, Jaehong Yoon, Wonjae Kim, Yunji Kim, Sung Ju Hwang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental validation on multiple benchmarks shows that our method achieves a 3.69%p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by 45%. Our code is available at https://cl-stella.github.io/. In this section, we experimentally validate the effectiveness of our method in task-free continual audio-video pre-training. We start by outlining our experimental setup in Sec. 5.1, covering datasets, evaluation methods, evaluation metrics, and baseline methods employed for our experiments. Subsequently, we present the experimental results and conduct a comprehensive analysis in Sec. 5.2. |
| Researcher Affiliation | Collaboration | Jaewoo Lee 1 * Jaehong Yoon 2 * Wonjae Kim 3 Yunji Kim 3 Sung Ju Hwang 1 4 1KAIST 2UNC Chapel Hill 3NAVER AI Lab 4Deep Auto. |
| Pseudocode | Yes | Algorithm 1 Audio time chunk selection in a Py Torch-like Style. Algorithm 2 Continual Pre-training of STELLA Algorithm 3 Continual Pre-training of STELLA+ |
| Open Source Code | Yes | Our code is available at https://cl-stella.github.io/. |
| Open Datasets | Yes | We validate our method on continual audio-video pre-training over VGGSound (Chen et al., 2020) and Audio Set (Gemmeke et al., 2017) datasets, consisting of 10s videos. For downstream tasks, we use two audiovisual datasets: MSR-VTT (Xu et al., 2016) and AVE (Tian et al., 2020). |
| Dataset Splits | No | The paper describes training and test sets (e.g., MSR-VTT training dataset and test dataset yielding 6k and 0.9k video clips respectively), but does not explicitly define a separate validation split from the main datasets or tasks for model evaluation or hyperparameter tuning. |
| Hardware Specification | Yes | GPUs 4 A100 or 4 V100 |
| Software Dependencies | No | The paper mentions optimizers like 'Adam' and 'Adam W' and provides 'Py Torch-like pseudo code', but it does not specify version numbers for any software dependencies or libraries (e.g., PyTorch version, CUDA version). |
| Experiment Setup | Yes | Table 6: Audio-Video pre-training and fine-tuning hyperparameters. This table specifies Optimizer, Learning rate, Weight decay, Learning rate schedule, Warmup epochs, Epoch, Batch size, and various audio/video processing parameters (e.g., Audio Random Time Shifting yes/no, Audio Norm Mean/STD, Video Multi Scale Crop yes/no, Video Norm Mean/STD). |