Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Action Dubber: Timing Audible Actions via Inflectional Flow

Authors: Wenlong Wan, Weiying Zheng, Tianyi Xiang, Guiqing Li, Shengfeng He

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments confirm the effectiveness of our approach on Audible623 and show strong generalizability to other domains, such as repetitive counting and sound source localization.
Researcher Affiliation Academia 1School of Computer Science and Engineering, South China University of Technology 2School of Computing and Information Systems, Singapore Management University 3School of Computing and Data Science, University of Hong Kong 4Department of Computer Science, City University of Hong Kong.
Pseudocode No The paper describes the method using text and figures such as Figure 4: Overview of TA2Net, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code and dataset are available at https:// github.com/Wenlong Wan/Audible623.
Open Datasets Yes To support this task, we introduce a new benchmark dataset, Audible623, derived from Kinetics and UCF101 by removing non-essential vocalization subsets. Extensive experiments confirm the effectiveness of our approach on Audible623 and show strong generalizability to other domains, such as repetitive counting and sound source localization. Code and dataset are available at https:// github.com/Wenlong Wan/Audible623.
Dataset Splits Yes After collecting and annotating the action videos, we obtain a total of 623 videos and allocate 497 videos for training and 126 videos for evaluation.
Hardware Specification Yes All experiments were conducted on a single NVIDIA A800 GPU with 80GB of memory, under Ubuntu 20.04 as the operating system.
Software Dependencies Yes We implement our method on Py Torch with CUDA, version 1.13.
Experiment Setup Yes During training, we randomly sample 64 frames per video, resizing them to 112x112 pixels. The model is trained using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 5e-6, a batch size of 4, and 20k iterations.