Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment
Authors: Jaewoo Lee, Jaehong Yoon, Wonjae Kim, Yunji Kim, Sung Ju Hwang
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental validation on multiple benchmarks shows that our method achieves a 3.69%p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by 45%. Our code is available at https://cl-stella.github.io/. In this section, we experimentally validate the effectiveness of our method in task-free continual audio-video pre-training. We start by outlining our experimental setup in Sec. 5.1, covering datasets, evaluation methods, evaluation metrics, and baseline methods employed for our experiments. Subsequently, we present the experimental results and conduct a comprehensive analysis in Sec. 5.2. |
| Researcher Affiliation | Collaboration | Jaewoo Lee 1 * Jaehong Yoon 2 * Wonjae Kim 3 Yunji Kim 3 Sung Ju Hwang 1 4 1KAIST 2UNC Chapel Hill 3NAVER AI Lab 4Deep Auto. |
| Pseudocode | Yes | Algorithm 1 Audio time chunk selection in a Py Torch-like Style. Algorithm 2 Continual Pre-training of STELLA Algorithm 3 Continual Pre-training of STELLA+ |
| Open Source Code | Yes | Our code is available at https://cl-stella.github.io/. |
| Open Datasets | Yes | We validate our method on continual audio-video pre-training over VGGSound (Chen et al., 2020) and Audio Set (Gemmeke et al., 2017) datasets, consisting of 10s videos. For downstream tasks, we use two audiovisual datasets: MSR-VTT (Xu et al., 2016) and AVE (Tian et al., 2020). |
| Dataset Splits | No | The paper describes training and test sets (e.g., MSR-VTT training dataset and test dataset yielding 6k and 0.9k video clips respectively), but does not explicitly define a separate validation split from the main datasets or tasks for model evaluation or hyperparameter tuning. |
| Hardware Specification | Yes | GPUs 4 A100 or 4 V100 |
| Software Dependencies | No | The paper mentions optimizers like 'Adam' and 'Adam W' and provides 'Py Torch-like pseudo code', but it does not specify version numbers for any software dependencies or libraries (e.g., PyTorch version, CUDA version). |
| Experiment Setup | Yes | Table 6: Audio-Video pre-training and fine-tuning hyperparameters. This table specifies Optimizer, Learning rate, Weight decay, Learning rate schedule, Warmup epochs, Epoch, Batch size, and various audio/video processing parameters (e.g., Audio Random Time Shifting yes/no, Audio Norm Mean/STD, Video Multi Scale Crop yes/no, Video Norm Mean/STD). |