Low-Fidelity Video Encoder Optimization for Temporal Action Localization

Authors: Mengmeng Xu, Juan Manuel Perez Rua, Xiatian Zhu, Bernard Ghanem, Brais Martinez

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that the proposed Lo Fi optimization approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight Res Net18 based video encoder in a single RGB stream, our method surpasses two-stream (RGB + optical flow) Res Net50 based alternatives, often by a good margin. Our code is publicly available at https://github.com/saic-fi/lofi_action_localization .
Researcher Affiliation Collaboration Mengmeng Xu1,2 mengmeng.xu@kaust.edu.sa Juan-Manuel Pérez-Rúa1 Perez Rua.JM@gmail.com Xiatian Zhu1 xiatian.zhu@samsung.com Bernard Ghanem2 bernard.ghanem@kaust.edu.sa Brais Martinez1 brais.a@samsung.com 1 Samsung AI Centre Cambridge, UK 2 King Abdullah University of Science and Technology, Saudi Arabia
Pseudocode No The paper describes the proposed method and training procedure in detailed textual descriptions and flow diagrams, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Our code is publicly available at https://github.com/saic-fi/lofi_action_localization .
Open Datasets Yes Datasets: We use Kinetics400 [28] as the auxiliary video classification dataset for initial pretraining of the video encoder. For model performance evaluation, we use two popular temporal action localization benchmarks. (1) Activity Net-v1.3 [22] contains 20K temporally annotated untrimmed videos with 200 action categories. (2) Human Action Clips and Segments (HACS-v1.1) [68] is a recent temporal action localization dataset.
Dataset Splits Yes In the standard evaluation protocol, these videos are divided into the training/validation/testing sets by the ratio of 2:1:1.
Hardware Specification Yes Hardware and software settings: We implemented our method using Py Torch 1.8 with CUDA 10.1. For Lo Fi training, we use 4 NVIDIA V100 GPUs, each with 32GB memory. Under this setting, the memory constraint is 128GB, which constitutes a mid-range computational budget. In the supplementary material, we further test a low-budget setting with a single V100 GPU in Table B.
Software Dependencies Yes Hardware and software settings: We implemented our method using Py Torch 1.8 with CUDA 10.1.
Experiment Setup Yes Implementation details: We use Res Net-based TSM [33] as the video encoder due to its good accuracy-cost trade-off and reasonable memory requirements compared to 3D-based alternatives. For the full fidelity setting (Eq. (1)), we follow the standard G-TAD protocol and represent each video with L = 100 snippets. The full spatial resolution is H W = 224 224. We keep the other hyper-parameters (e.g., the number of GCNe Xt layers) the same as in the default G-TAD configuration. However, the number of anchor proposals can be reduced when L is less. Concretely, we enumerate all the possible combinations of start and end as the anchors, e.g., {(ts, te)| 0 < ts < te < L; ts, te N; te ts < L}. For Lo Fi training, we use an SGD optimizer. The batch size is 16 for all the training methods and input patterns. The weight decay is 10 4 and we set the momentum to 0, which is standard for fine-tuning [32]. The learning rate is 0.1, and it is decayed by 0.5 after every 5 epochs.