Temporal Context Enhanced Feature Aggregation for Video Object Detection
Authors: Fei He, Naiyu Gao, Qiaozhe Li, Senyao Du, Xin Zhao, Kaiqi Huang10941-10948
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our TCENet achieves state-of-the-art performance on the Image Net VID dataset and has a faster runtime. Without bellsand-whistles, our TCENet achieves 80.3% m AP by only aggregating 3 frames. Experiments Experiment Setup Dataset. Following most of the previous video object detection works, we evaluate our method on the Image Net (Deng et al. 2009) VID. |
| Researcher Affiliation | Collaboration | Fei He,1,2 Naiyu Gao,1,2 Qiaozhe Li,1,2 Senyao Du,3 Xin Zhao,1,2 Kaiqi Huang1,2,4 1CRISE, Institute of Automation, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Horizon Robotics 4CAS Center for Excellence in Brain Science and Intelligence Technology |
| Pseudocode | Yes | Algorithm 1 Inference algorithm of temporal context enhanced feature aggregation for video object detection. |
| Open Source Code | No | The paper does not provide an explicit statement or link to the open-source code for the described methodology. |
| Open Datasets | Yes | Dataset. Following most of the previous video object detection works, we evaluate our method on the Image Net (Deng et al. 2009) VID. VID dataset contains 3862 training videos and 555 validation videos. |
| Dataset Splits | Yes | VID dataset contains 3862 training videos and 555 validation videos. All videos are fully annotated with the object bounding box, object category, and tracking IDs. There are 30 object categories. They are a subset of the categories in the Image Net DET dataset. Mean average precision (m AP) is used as the evaluation metric and all results on the validation set are reported following the previous methods (Zhu et al. 2017a; Lee et al. 2016). Implementation Details. During training, following previous works, both the Image Net DET training set and the Image Net VID training set are utilized. Two-phase training is performed. |
| Hardware Specification | Yes | All the above results are tested on an NVIDIA Titan X Pascal GPU. |
| Software Dependencies | No | The paper mentions various models and architectures (e.g., Res Net-101, R-FCN, DCN), but does not provide specific version numbers for software dependencies such as Python, PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | Implementation Details. During training, following previous works, both the Image Net DET training set and the Image Net VID training set are utilized. Two-phase training is performed. In the first phase, the detection networks, the Deform Align module, and TCEA are trained on Image Net DET and Image Net VID, only the same 30 categories are used. Each training batch contains three images. If they are sampled from DET, all images within the same mini-batch will be the same because DET only has images. If they are sampled from VID, two supporting frames are randomly sampled near the reference frame in the range of [-9,9]. In the second phase, the whole network except temporal stride predictor will be fixed. Then the predictor is trained based on the feature network with Image Net VID. Each training batch has a pair of images, and the time step between them is randomly taken in [5,15]. In both training and inference, the images are resized to a shorter side of 600 pixels for the feature network. |