EAC-Net: Efficient and Accurate Convolutional Network for Video Recognition
Authors: Bowei Jin, Zhuo Xu11149-11156
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments on Kinetics, our EAC-Nets achieved better results than TSM models with fewer FLOPs. With same 2D backbones, EAC-Nets outperformed Non-Local I3D counterparts by achieving higher accuracy with only about 7 fewer FLOPs. |
| Researcher Affiliation | Industry | Bowei Jin, Zhuo Xu i FLYTEK Research, Suzhou, China {bwjin, zhuoxu}@iflytek.com |
| Pseudocode | No | The paper describes the architecture and formulations of its proposed blocks (MGTE and ATE) using diagrams and mathematical equations, but it does not include formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We perform comprehensive studies on the challenging Kinetics dataset (Kay et al. 2017). This dataset contains 246k training videos and 20k validation videos. It is a classification task involving 400 human action categories. We train all models on the training set and test on the validation set. Other datasets reported are Something-something V1 (Goyal et al. 2017) which consists of 110k videos of 174 different low-level actions. In all experiments, our models are initialized by Image Net (Russakovsky et al. 2015) pre-trained models. |
| Dataset Splits | Yes | This dataset contains 246k training videos and 20k validation videos. It is a classification task involving 400 human action categories. We train all models on the training set and test on the validation set. During training. We first sample 32 frames at random rate from a video, and resize shorter side of each sampled frame to a number prepicked randomly from 215 to 345. Then 224 224 randomly cropping is applied to these processed frames, leading to the network input with dimension of 32 3 224 224. For Kinetics, we train for up to 60 epochs, starting with a learning rate of 0.001 and a 10 reduction of learning rate respectively at 30, 50 epoch. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper describes the model architecture and training procedures but does not specify any software dependencies or their version numbers (e.g., deep learning frameworks like PyTorch/TensorFlow, or Python version). |
| Experiment Setup | Yes | Implementation Details During training. We first sample 32 frames at random rate from a video, and resize shorter side of each sampled frame to a number prepicked randomly from 215 to 345. Then 224 224 randomly cropping is applied to these processed frames, leading to the network input with dimension of 32 3 224 224. In all experiments, our models are initialized by Image Net (Russakovsky et al. 2015) pre-trained models. For Kinetics, we train for up to 60 epochs, starting with a learning rate of 0.001 and a 10 reduction of learning rate respectively at 30, 50 epoch. We use a momentum of 0.9 and a weight decay of 5e-4. We then fine-tuned models pre-trained on Kinetics to Something-Something V1 dataset, where fine-tuning is conducted for 25 total epochs, starting with initial learning rate 0.001 and reduced by a factor of 0.1 respectively at 10, 15, 20 epoch. |