MVFNet: Multi-View Fusion Network for Efficient Video Recognition
Authors: Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, Errui Ding2943-2951
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. |
| Researcher Affiliation | Collaboration | 1 Department of Computer Vision Technology (VIS), Baidu Inc. 2 MIT-IBM Watson AI Lab |
| Pseudocode | No | The paper describes the architecture and module design in text and diagrams (Fig. 2), but it does not include a formal pseudocode block or algorithm listing. |
| Open Source Code | Yes | Codes and models are available2. 2https://github.com/whwu95/MVFNet |
| Open Datasets | Yes | We evaluate our method on three large-scale video recognition benchmarks, including Kinetics-400 (K400) (Kay et al. 2017), Something-Something (Sth-Sth) V1&V2 (Goyal et al. 2017), and other two small-scale datasets, UCF-101 (Soomro, Zamir, and Shah 2012) and HMDB51 (Kuehne et al. 2011). |
| Dataset Splits | Yes | Kinetics-400 contains 400 human action categories and provides around 240k training videos and 20k validation videos. ... For Something-Something V1 & V2 dataset, our model is trained for 50 epochs starting with a learning rate 0.01 and reducing it by a factor of 10 at 30, 40 and 45 epochs. |
| Hardware Specification | No | The paper states, "For all of our experiments, we utilize SGD with momentum 0.9 and weight decay of 1e-4 to train our models on 8 GPUs," but it does not specify the model or type of GPUs, CPUs, or any other hardware components. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., |
| Experiment Setup | Yes | On the Kinetics-400 dataset, the learning rate is 0.01 and will be reduced by a factor of 10 at 90 and 130 epochs (150 epochs in total) respectively. For Something-Something V1 & V2 dataset, our model is trained for 50 epochs starting with a learning rate 0.01 and reducing it by a factor of 10 at 30, 40 and 45 epochs. ... For all of our experiments, we utilize SGD with momentum 0.9 and weight decay of 1e-4 to train our models on 8 GPUs. Each GPU processes a mini-batch of 8 video clips by default. ... we sample 4, 8 or 16 frames as a clip. The size of the short side of these frames is fixed to 256 and then random scaling is utilized for data augmentation. Finally, we resize the cropped regions to 224 224 for network training. |