Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding
Authors: Peijun Bao, Yong Xia, Wenhan Yang, Boon Poh Ng, Meng Hwa Er, Alex C. Kot
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs. |
| Researcher Affiliation | Academia | 1Nanyang Technological University 2Northwestern Polytechnical University 3Peng Cheng Laboratory peijun001@e.ntu.edu.sg, yxia@nwpu.edu.cn, yangwh@pcl.ac.cn, {ebpng, emher, eackot}@ntu.edu.sg |
| Pseudocode | No | The paper describes algorithms and models (e.g., 'Local-Global Multi-Modal Distillation (MMDist) explores leveraging multi-modal videos'), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about making source code publicly available, nor does it provide any links to a code repository. |
| Open Datasets | Yes | We validate the performance of the proposed methods against the state-of-the-art approaches on two large-scale datasets: 1) Charades-STA (Gao et al. 2017) includes 9,848 videos of daily indoor activities... 2) Activity Net Captions (Krishna et al. 2017) consists of 19,290 untrimmed videos... |
| Dataset Splits | No | The paper mentions using standard datasets (Charades-STA, Activity Net Captions) which typically have predefined splits, but it does not explicitly provide details on training, validation, or test split percentages or sample counts within the text. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. It only mentions networks used for feature extraction. |
| Software Dependencies | No | The paper mentions several components like 'I3D network', 'C3D network', 'TV-L1 algorithm', 'Glo Ve word2vec', and 'Adam optimizer', but it does not provide specific version numbers for any of the software or libraries used (e.g., Python, PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | We set the maximum description length to 20 on both datasets. The vocabulary size is 8000 on Activity Net-Captions and 1111 on Charades-STA respectively. We mask 1/3 of words in the query sentence for reconstruction. The dimensions of the hidden state d for both language and visual features are set to be 256. The number of video snippets L is resampled to 200 on both datasets. We use the Adam optimizer (Kingma and Ba 2014) for the model training with a batch size of 32. For multi-modal distillation, we first train the teacher model with 15 epochs with a learning rate of 0.00035, and then distill it to the student model with another 15 epochs with a learning rate of 0.0005. The training of the single-modal baseline is independent of the teacher / student models, where the number of training epochs and learning rate for it are set to 15 and 0.0004 respectively. The hyperparameter of α, β, and γ is set to 4.5 0.9, and 3.0 respectively. |