YOLOV: Making Still Image Object Detectors Great at Video Object Detection

Authors: Yuheng Shi, Naiyan Wang, Xiaojie Guo

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments and ablation studies to verify the efficacy of our design, and reveal its superiority over other state-of-the-art VID approaches in both effectiveness and efficiency.
Researcher Affiliation Collaboration Yuheng Shi1, Naiyan Wang2, Xiaojie Guo1* 1 Tianjin University 2 Tu Simple yuheng@tju.edu.cn, winsty@gmail.com, xj.max.guo@gmail.com
Pseudocode No The paper describes the methodology using text and mathematical equations, but does not include a structured pseudocode or algorithm block.
Open Source Code Yes The implementation is simple, we have made the demo codes and models available at https://github.com/Yu Hengsss/YOLOV.
Open Datasets Yes Specifically, the Image Net VID (Russakovsky et al. 2015) contains 3,862 videos for training and 555 videos for validation. There are 30 categories in the VID dataset, i.e., a subset of the 200 basic-level categories of Image Net DET (Russakovsky et al. 2015).
Dataset Splits Yes Specifically, the Image Net VID (Russakovsky et al. 2015) contains 3,862 videos for training and 555 videos for validation. ... In the training phase, the images are randomly resized from 352 × 352 to 672 × 672 with 32 steps. In the testing phase, the images are uniformly resized to 576 × 576.
Hardware Specification Yes Our YOLOX-based model can achieve promising performance (e.g., 87.5% AP50 at over 30 FPS on the Image Net VID dataset on a single 2080Ti GPU)... and "we test all of the models with FP16-precision on a 2080Ti GPU unless otherwise stated."
Software Dependencies No The paper mentions YOLOX, SGD, and FP16-precision, but does not provide specific version numbers for software dependencies such as PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes The base detectors are trained for 7 epochs by SGD with batch size of 16 on 2 GPUs. As for the learning rate, we adopt the cosine learning rate schedule used in YOLOX with one epoch for warming up and disable the strong data augmentation for the last 2 epochs. When integrating the feature aggregation module into the base detectors, we fine-tune them for 150K iterations with batch size of 16 on a single 2080Ti GPU. In addition, we use warm-up for the first 15K iterations and cosine learning rate schedule for the rest iterations. For the training of feature aggregation module, the number of frames f is set to 16, and the threshold of NMS is set to 0.75 for rough feature selection. While for producing final detection boxes, we alternatively set the threshold of NMS to 0.5 for retaining more confident candidates. In the training phase, the images are randomly resized from 352 × 352 to 672 × 672 with 32 steps. In the testing phase, the images are uniformly resized to 576 × 576.