MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Authors: Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, Errui Ding2943-2951

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority.
Researcher Affiliation Collaboration 1 Department of Computer Vision Technology (VIS), Baidu Inc. 2 MIT-IBM Watson AI Lab
Pseudocode No The paper describes the architecture and module design in text and diagrams (Fig. 2), but it does not include a formal pseudocode block or algorithm listing.
Open Source Code Yes Codes and models are available2. 2https://github.com/whwu95/MVFNet
Open Datasets Yes We evaluate our method on three large-scale video recognition benchmarks, including Kinetics-400 (K400) (Kay et al. 2017), Something-Something (Sth-Sth) V1&V2 (Goyal et al. 2017), and other two small-scale datasets, UCF-101 (Soomro, Zamir, and Shah 2012) and HMDB51 (Kuehne et al. 2011).
Dataset Splits Yes Kinetics-400 contains 400 human action categories and provides around 240k training videos and 20k validation videos. ... For Something-Something V1 & V2 dataset, our model is trained for 50 epochs starting with a learning rate 0.01 and reducing it by a factor of 10 at 30, 40 and 45 epochs.
Hardware Specification No The paper states, "For all of our experiments, we utilize SGD with momentum 0.9 and weight decay of 1e-4 to train our models on 8 GPUs," but it does not specify the model or type of GPUs, CPUs, or any other hardware components.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g.,
Experiment Setup Yes On the Kinetics-400 dataset, the learning rate is 0.01 and will be reduced by a factor of 10 at 90 and 130 epochs (150 epochs in total) respectively. For Something-Something V1 & V2 dataset, our model is trained for 50 epochs starting with a learning rate 0.01 and reducing it by a factor of 10 at 30, 40 and 45 epochs. ... For all of our experiments, we utilize SGD with momentum 0.9 and weight decay of 1e-4 to train our models on 8 GPUs. Each GPU processes a mini-batch of 8 video clips by default. ... we sample 4, 8 or 16 frames as a clip. The size of the short side of these frames is fixed to 256 and then random scaling is utilized for data augmentation. Finally, we resize the cropped regions to 224 224 for network training.