Temporal ROI Align for Video Object Recognition
Authors: Tao Gong, Kai Chen, Xinjiang Wang, Qi Chu, Feng Zhu, Dahua Lin, Nenghai Yu, Huamin Feng1442-1450
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We integrate it into single-frame video detectors and other state-of-the-art video detectors, and conduct quantitative experiments to demonstrate that the proposed Temporal ROI Align operator can consistently and significantly boost the performance. |
| Researcher Affiliation | Collaboration | Tao Gong 1,2 , Kai Chen 3, Xinjiang Wang 3, Qi Chu 1,2 , Feng Zhu 3, Dahua Lin 4, Nenghai Yu 1,2, Huamin Feng 5 1 School of Cyberspace Security, University of Science and Technology of China 2Key Laboratory of Electromagnetic Space Information, Chinese Academy of Sciences 3 Sense Time Research 4 The Chinese University of Hong Kong 5 Beijing Electronic Science and Technology Institute |
| Pseudocode | No | The paper describes the methods using text, mathematical formulas, and diagrams, but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | Experiments are carried out on the Image Net VID dataset (Russakovsky et al. 2015) which contains 30 object categories for video object detection. Therefore we evaluate Temporal ROI Align on the EPIC KITCHENS dataset (Damen et al. 2018). The dataset of VIS is You Tube-VIS (Yang, Fan, and Xu 2019) which contains 40 object categories. |
| Dataset Splits | Yes | Experiments are carried out on the Image Net VID dataset (Russakovsky et al. 2015) which contains 30 object categories for video object detection. There are total 3862 video snippets in the training set and 555 video snippets in the validation set. The models are trained with a mixture of Image Net VID and Image Net DET datasets (Russakovsky et al. 2015) (using 30 VID classes and the overlapping 30 of 200 DET classes). For fair comparison, we use the split provided in FGFA (Zhu et al. 2017a). At most 15 frames are subsampled from each video, and the DET:VID balance is approximately 1:1. For training, one training frame is sampled along with two random frames from the same video. |
| Hardware Specification | No | A total of 6 epochs of SGD training is performed with a total batch size of 16 on 16 GPUs. |
| Software Dependencies | No | The paper mentions using ResNet-101 and ResNeXt-101 backbones, SGD optimizer, and RPN, but does not provide specific version numbers for any software libraries or programming languages used for implementation. |
| Experiment Setup | Yes | The stride of the first conv block in the conv5 stage of convolutional layers is modified from 2 to 1 in order to enlarge the resolution of feature maps. As such, the effective stride in the stage is changed from 32 pixels to 16 pixels. All the 3 3 conv layers in the stage are modified by the dilated convolutions to compensate the receptive fields. There are a total of 12 anchors with 4 scales {642, 1282, 2562, 5122} and 3 aspect ratios {1 : 2, 1 : 1, 2 : 1}. 300 proposals are produced on each image. The Temporal ROI Align is applied on the ouput of conv5. The spatial size h, w of the ROI features are set to 7. For MS ROI Align, there are a total of 49 similarity maps needed to be calculated for each proposal, and the top 4 (K = 4) similarity scores and corresponding spatial locations are selected for each similarity map. For TAFA, a total of 4 (N = 4) temporal attention blocks are used to aggregate the ROI features and these most similar ROI features, and each ψn( ) is a 3 3 convolution layer. Two 1024-d fully connected layers are applied upon the temporal ROI features followed by classification and bounding box regression. A total of 6 epochs of SGD training is performed with a total batch size of 16 on 16 GPUs. The initial learning rate is 0.02 and is divided by 10 at the 4-th and 6-th epoch. The models are trained with a mixture of Image Net VID and Image Net DET datasets (Russakovsky et al. 2015) (using 30 VID classes and the overlapping 30 of 200 DET classes). In both training and inference, the images are resized to a shorter side of 600 pixels. |