Relation-Guided Spatial Attention and Temporal Refinement for Video-Based Person Re-Identification

Authors: Xingze Li, Wengang Zhou, Yun Zhou, Houqiang Li11434-11441

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four prevalent benchmarks verify the state-of-the-art performance of the proposed method.
Researcher Affiliation Academia Xingze Li, Wengang Zhou, Yun Zhou, Houqiang Li CAS Key Laboratory of Technology in GIPAS, EEIS Department, University of Science and Technology of China lixingze@mail.ustc.edu.cn, {zhwg, zhouyun, lihq}@ustc.edu.cn
Pseudocode No The paper describes methods using diagrams and equations but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include any explicit statement about releasing code or a link to a source code repository.
Open Datasets Yes MARS (Zheng et al. 2016) is one of the largest video-based person re-identification benchmark... Duke MTMC-Video Re ID (Wu et al. 2018b) is another large video-based person re-identification benchmark... i LIDS-VID (Wang et al. 2014) and PRID-2011 (Hirzer et al. 2011) are two small benchmarks.
Dataset Splits Yes For the MARS and Duke MTMC-Video Re ID datasets, we adopt the widely used training/testing splits provided by (Zheng et al. 2016) and (Wu et al. 2018b). For the i LIDS-VID and PRID2011 datasets, we randomly split the identities equally into the training set and testing set.
Hardware Specification Yes Our model is implemented by Pytorch and optimized using four NVIDIA Tesla V100 GPUs.
Software Dependencies No Our model is implemented by Pytorch and optimized using four NVIDIA Tesla V100 GPUs.
Experiment Setup Yes In the training phase, we randomly select T frames from a variable-length sequence to form a fixed-length input clip. Each batch consists of P identities and K input clips for each identity. In all our experiments, we select P = 18 and K = 4, therefore, the batch size is 72T. All images are resized to 256 128, and randomly horizontal flipped. Random erasing (Zhong et al. 2017) is also used as data augmentation. We use the Res Net50 (He et al. 2016) pretrained on the Image Net (Deng et al. 2009) dataset as backbone network. The last pooling layer and fully connected layer are removed and the stride in the last down-sampling in the conv5 x block is set to 1. The model is optimized using Adam (Kingma and Ba 2014) with weight decay 5 10 4. The initial learning rate is 3 10 4 and it is reduced to 3 10 5 and 3 10 6 after training 125 and 250 epochs. The model is trained for 375 epochs in total.