Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment

Authors: Jaewoo Lee, Jaehong Yoon, Wonjae Kim, Yunji Kim, Sung Ju Hwang

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental validation on multiple benchmarks shows that our method achieves a 3.69%p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by 45%. Our code is available at https://cl-stella.github.io/. In this section, we experimentally validate the effectiveness of our method in task-free continual audio-video pre-training. We start by outlining our experimental setup in Sec. 5.1, covering datasets, evaluation methods, evaluation metrics, and baseline methods employed for our experiments. Subsequently, we present the experimental results and conduct a comprehensive analysis in Sec. 5.2.
Researcher Affiliation Collaboration Jaewoo Lee 1 * Jaehong Yoon 2 * Wonjae Kim 3 Yunji Kim 3 Sung Ju Hwang 1 4 1KAIST 2UNC Chapel Hill 3NAVER AI Lab 4Deep Auto.
Pseudocode Yes Algorithm 1 Audio time chunk selection in a Py Torch-like Style. Algorithm 2 Continual Pre-training of STELLA Algorithm 3 Continual Pre-training of STELLA+
Open Source Code Yes Our code is available at https://cl-stella.github.io/.
Open Datasets Yes We validate our method on continual audio-video pre-training over VGGSound (Chen et al., 2020) and Audio Set (Gemmeke et al., 2017) datasets, consisting of 10s videos. For downstream tasks, we use two audiovisual datasets: MSR-VTT (Xu et al., 2016) and AVE (Tian et al., 2020).
Dataset Splits No The paper describes training and test sets (e.g., MSR-VTT training dataset and test dataset yielding 6k and 0.9k video clips respectively), but does not explicitly define a separate validation split from the main datasets or tasks for model evaluation or hyperparameter tuning.
Hardware Specification Yes GPUs 4 A100 or 4 V100
Software Dependencies No The paper mentions optimizers like 'Adam' and 'Adam W' and provides 'Py Torch-like pseudo code', but it does not specify version numbers for any software dependencies or libraries (e.g., PyTorch version, CUDA version).
Experiment Setup Yes Table 6: Audio-Video pre-training and fine-tuning hyperparameters. This table specifies Optimizer, Learning rate, Weight decay, Learning rate schedule, Warmup epochs, Epoch, Batch size, and various audio/video processing parameters (e.g., Audio Random Time Shifting yes/no, Audio Norm Mean/STD, Video Multi Scale Crop yes/no, Video Norm Mean/STD).