reproducibilityindex.ai

Temporal Context Enhanced Feature Aggregation for Video Object Detection

Authors: Fei He, Naiyu Gao, Qiaozhe Li, Senyao Du, Xin Zhao, Kaiqi Huang10941-10948

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our TCENet achieves state-of-the-art performance on the Image Net VID dataset and has a faster runtime. Without bellsand-whistles, our TCENet achieves 80.3% m AP by only aggregating 3 frames. Experiments Experiment Setup Dataset. Following most of the previous video object detection works, we evaluate our method on the Image Net (Deng et al. 2009) VID.
Researcher Affiliation	Collaboration	Fei He,1,2 Naiyu Gao,1,2 Qiaozhe Li,1,2 Senyao Du,3 Xin Zhao,1,2 Kaiqi Huang1,2,4 1CRISE, Institute of Automation, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Horizon Robotics 4CAS Center for Excellence in Brain Science and Intelligence Technology
Pseudocode	Yes	Algorithm 1 Inference algorithm of temporal context enhanced feature aggregation for video object detection.
Open Source Code	No	The paper does not provide an explicit statement or link to the open-source code for the described methodology.
Open Datasets	Yes	Dataset. Following most of the previous video object detection works, we evaluate our method on the Image Net (Deng et al. 2009) VID. VID dataset contains 3862 training videos and 555 validation videos.
Dataset Splits	Yes	VID dataset contains 3862 training videos and 555 validation videos. All videos are fully annotated with the object bounding box, object category, and tracking IDs. There are 30 object categories. They are a subset of the categories in the Image Net DET dataset. Mean average precision (m AP) is used as the evaluation metric and all results on the validation set are reported following the previous methods (Zhu et al. 2017a; Lee et al. 2016). Implementation Details. During training, following previous works, both the Image Net DET training set and the Image Net VID training set are utilized. Two-phase training is performed.
Hardware Specification	Yes	All the above results are tested on an NVIDIA Titan X Pascal GPU.
Software Dependencies	No	The paper mentions various models and architectures (e.g., Res Net-101, R-FCN, DCN), but does not provide specific version numbers for software dependencies such as Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	Implementation Details. During training, following previous works, both the Image Net DET training set and the Image Net VID training set are utilized. Two-phase training is performed. In the ﬁrst phase, the detection networks, the Deform Align module, and TCEA are trained on Image Net DET and Image Net VID, only the same 30 categories are used. Each training batch contains three images. If they are sampled from DET, all images within the same mini-batch will be the same because DET only has images. If they are sampled from VID, two supporting frames are randomly sampled near the reference frame in the range of [-9,9]. In the second phase, the whole network except temporal stride predictor will be ﬁxed. Then the predictor is trained based on the feature network with Image Net VID. Each training batch has a pair of images, and the time step between them is randomly taken in [5,15]. In both training and inference, the images are resized to a shorter side of 600 pixels for the feature network.