reproducibilityindex.ai

Low-Fidelity Video Encoder Optimization for Temporal Action Localization

Authors: Mengmeng Xu, Juan Manuel Perez Rua, Xiatian Zhu, Bernard Ghanem, Brais Martinez

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that the proposed Lo Fi optimization approach can signiﬁcantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight Res Net18 based video encoder in a single RGB stream, our method surpasses two-stream (RGB + optical ﬂow) Res Net50 based alternatives, often by a good margin. Our code is publicly available at https://github.com/saic-ﬁ/loﬁ_action_localization .
Researcher Affiliation	Collaboration	Mengmeng Xu1,2 mengmeng.xu@kaust.edu.sa Juan-Manuel Pérez-Rúa1 Perez Rua.JM@gmail.com Xiatian Zhu1 xiatian.zhu@samsung.com Bernard Ghanem2 bernard.ghanem@kaust.edu.sa Brais Martinez1 brais.a@samsung.com 1 Samsung AI Centre Cambridge, UK 2 King Abdullah University of Science and Technology, Saudi Arabia
Pseudocode	No	The paper describes the proposed method and training procedure in detailed textual descriptions and flow diagrams, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our code is publicly available at https://github.com/saic-ﬁ/loﬁ_action_localization .
Open Datasets	Yes	Datasets: We use Kinetics400 [28] as the auxiliary video classiﬁcation dataset for initial pretraining of the video encoder. For model performance evaluation, we use two popular temporal action localization benchmarks. (1) Activity Net-v1.3 [22] contains 20K temporally annotated untrimmed videos with 200 action categories. (2) Human Action Clips and Segments (HACS-v1.1) [68] is a recent temporal action localization dataset.
Dataset Splits	Yes	In the standard evaluation protocol, these videos are divided into the training/validation/testing sets by the ratio of 2:1:1.
Hardware Specification	Yes	Hardware and software settings: We implemented our method using Py Torch 1.8 with CUDA 10.1. For Lo Fi training, we use 4 NVIDIA V100 GPUs, each with 32GB memory. Under this setting, the memory constraint is 128GB, which constitutes a mid-range computational budget. In the supplementary material, we further test a low-budget setting with a single V100 GPU in Table B.
Software Dependencies	Yes	Hardware and software settings: We implemented our method using Py Torch 1.8 with CUDA 10.1.
Experiment Setup	Yes	Implementation details: We use Res Net-based TSM [33] as the video encoder due to its good accuracy-cost trade-off and reasonable memory requirements compared to 3D-based alternatives. For the full ﬁdelity setting (Eq. (1)), we follow the standard G-TAD protocol and represent each video with L = 100 snippets. The full spatial resolution is H W = 224 224. We keep the other hyper-parameters (e.g., the number of GCNe Xt layers) the same as in the default G-TAD conﬁguration. However, the number of anchor proposals can be reduced when L is less. Concretely, we enumerate all the possible combinations of start and end as the anchors, e.g., {(ts, te)\| 0 < ts < te < L; ts, te N; te ts < L}. For Lo Fi training, we use an SGD optimizer. The batch size is 16 for all the training methods and input patterns. The weight decay is 10 4 and we set the momentum to 0, which is standard for ﬁne-tuning [32]. The learning rate is 0.1, and it is decayed by 0.5 after every 5 epochs.