Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding

Authors: Peijun Bao, Yong Xia, Wenhan Yang, Boon Poh Ng, Meng Hwa Er, Alex C. Kot

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs.
Researcher Affiliation Academia 1Nanyang Technological University 2Northwestern Polytechnical University 3Peng Cheng Laboratory peijun001@e.ntu.edu.sg, yxia@nwpu.edu.cn, yangwh@pcl.ac.cn, {ebpng, emher, eackot}@ntu.edu.sg
Pseudocode No The paper describes algorithms and models (e.g., 'Local-Global Multi-Modal Distillation (MMDist) explores leveraging multi-modal videos'), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about making source code publicly available, nor does it provide any links to a code repository.
Open Datasets Yes We validate the performance of the proposed methods against the state-of-the-art approaches on two large-scale datasets: 1) Charades-STA (Gao et al. 2017) includes 9,848 videos of daily indoor activities... 2) Activity Net Captions (Krishna et al. 2017) consists of 19,290 untrimmed videos...
Dataset Splits No The paper mentions using standard datasets (Charades-STA, Activity Net Captions) which typically have predefined splits, but it does not explicitly provide details on training, validation, or test split percentages or sample counts within the text.
Hardware Specification No The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. It only mentions networks used for feature extraction.
Software Dependencies No The paper mentions several components like 'I3D network', 'C3D network', 'TV-L1 algorithm', 'Glo Ve word2vec', and 'Adam optimizer', but it does not provide specific version numbers for any of the software or libraries used (e.g., Python, PyTorch, TensorFlow, etc.).
Experiment Setup Yes We set the maximum description length to 20 on both datasets. The vocabulary size is 8000 on Activity Net-Captions and 1111 on Charades-STA respectively. We mask 1/3 of words in the query sentence for reconstruction. The dimensions of the hidden state d for both language and visual features are set to be 256. The number of video snippets L is resampled to 200 on both datasets. We use the Adam optimizer (Kingma and Ba 2014) for the model training with a batch size of 32. For multi-modal distillation, we first train the teacher model with 15 epochs with a learning rate of 0.00035, and then distill it to the student model with another 15 epochs with a learning rate of 0.0005. The training of the single-modal baseline is independent of the teacher / student models, where the number of training epochs and learning rate for it are set to 15 and 0.0004 respectively. The hyperparameter of α, β, and γ is set to 4.5 0.9, and 3.0 respectively.