MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Authors: Minghao Zhu, Zhengpu Wang, Mengxian Hu, Ronghao Dang, Xiao Lin, Xun Zhou, Chengju Liu, Qijun Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate Mo TE achieves an optimal trade-off between zero-shot and close-set performance with one unified model. Thorough ablation studies show the scalability and effectiveness of our proposed method ( 4).
Researcher Affiliation Academia Minghao Zhu Zhengpu Wang Mengxian Hu Ronghao Dang Xiao Lin Xun Zhou Chengju Liu Qijun Chen Tongji University, Shanghai, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/ZMHH-H/Mo TE.
Open Datasets Yes We fine-tune our model using the Kinetics-400 [15] dataset as in previous works [28]... Zero-shot: Following previous works [28, 34], we evaluate zero-shot performance on UCF-101 [38], HMDB-51 [19], and Kinetics-600 [3].
Dataset Splits Yes Kinetics-400 [15] is a large-scale dataset in the video domain. The dataset contains 240k training videos and 20k validation videos in 400 human action categories... UCF-101 [38]: There are three official splits of training data and validation data. HMDB-51 [19]: There are three official splits of the dataset, each with 3,570 training data and 1,530 validation data.
Hardware Specification Yes We conduct experiments with 3 NVIDIA Ge Force RTX 4090.
Software Dependencies No The paper lists 'Adam W' as the optimizer but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes In Table 6, we present the hyper-parameters set for optimization. ... Batch size 144 Optimizer Adam W Weight decay 0.2 Adam β1,β2 0.9, 0.999 Learning rate (Base) 5e-5 Learning rate (CLIP layers) 3e-6 Learning rate decay Cosine schedule Training epochs 30 (Vi T-B), 20 (Vi T-L) Linear warm-up epochs 5