MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

Authors: Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson2620-2627

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive evaluations on the challenging Image Net VID dataset. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy. and Experiments Experimental Settings Dataset and Evaluation. We evaluate our method on the Image Net (Russakovsky et al. 2015) VID dataset which contains 3862 training and 555 validation videos.
Researcher Affiliation Collaboration Guanxiong Sun1, 2, Yang Hua1, Guosheng Hu2, 1, Neil Robertson1 1EEECS/ECIT, Queen s University Belfast, UK 2Anyvision, Belfast, UK {gsun02, y.hua, n.robertson}@qub.ac.uk, huguosheng100@gmail.com
Pseudocode Yes Algorithm 1: Inference Algorithm with Memory Bank in a Py Torch-like style.
Open Source Code No No statement about open-source code release or a link to a repository.
Open Datasets Yes We evaluate our method on the Image Net (Russakovsky et al. 2015) VID dataset which contains 3862 training and 555 validation videos.
Dataset Splits Yes We evaluate our method on the Image Net (Russakovsky et al. 2015) VID dataset which contains 3862 training and 555 validation videos. We follow the previous approaches... and train our model on the overlapped 30 classes of Image Net VID and DET set. Specifically, we sample 15 frames from each video in VID dataset and at most 2,000 images per class from DET dataset as our training set. Then we report the mean average precision (m AP) on the validation set.
Hardware Specification Yes The whole architecture is trained on 4 Titan RTX GPUs with SGD (momentum: 0.9, weight decay: 0.0001). and All results are obtained on Titan RTX GPUs.
Software Dependencies No Algorithm 1: Inference Algorithm with Memory Bank in a Py Torch-like style. No version numbers for PyTorch or other libraries are given.
Experiment Setup Yes The whole architecture is trained on 4 Titan RTX GPUs with SGD (momentum: 0.9, weight decay: 0.0001). In the first phase, we only train the pixel-level enhancement. Each GPU contains one minibatch consisting of two frames, the key frame Ik and a ran- domly selected frame from the video to approximately form pixel-level memory. Both RPN losses and Detection losses are only computed on the key frame. We train the pixel-level model for 60K iterations. The learning rate is 0.001 for the first 40K iterations, and 0.0001 for the last 20k iterations. In the second phase, we end-to-end train both pixel-level enhancement and instance-level enhancement for 120K iterations. The learning rate is 0.001 for the first 80K iterations and 0.0001 for the last 40K iterations. We use 12 anchors with 4 scales 642, 1282, 2562, 5122 and 3 aspect ratios 1 : 2, 1 : 1, 2 : 1. Non-maximum suppression (NMS) is applied to generate 300 proposals for each image with an Io U threshold 0.7. Finally, NMS is applied to clean the detection results, with Io U threshold 0.5.