Representing Sets of Instances for Visual Recognition

Authors: Jianxin Wu, Bin-Bin Gao, Guoqing Liu

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental D3 is evaluated in action and image recognition tasks. It achieves excellent robustness, accuracy and speed.
Researcher Affiliation Collaboration Jianxin Wu,1 Bin-Bin Gao,1 Guoqing Liu2 1 National Key Laboratory for Novel Software Technology Nanjing University, China wujx2001@nju.edu.cn, gaobb@lamda.nju.edu.cn 2 Minieye, Youjia Innovation LLC, China guoqing@minieye.cc
Pseudocode Yes Algorithm 1 Visual representation using D3
Open Source Code No The paper does not provide an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets Yes UCF 101 (Soomro, Zamir, and Shah 2012), HMDB 51 (Kuehne et al. 2011) and Youtube (Liu, Luo, and Shah 2009). For UCF 101, the three splits of train and test videos in (Jiang et al. 2013) are used... Scene 15 (Lazebnik, Schmid, and Ponce 2006)... MIT indoor 67 (Quattoni and Torralba 2009)... Caltech 101 (Fei-Fei, Fergus, and Perona 2004)... Caltech 256 (Griffin, Holub, and Perona 2007)... SUN 397 (Xiao et al. 2010).
Dataset Splits Yes For UCF 101, the three splits of train and test videos in (Jiang et al. 2013) are used and we report the average accuracy. Scene 15... We use 100 training images per category, the rest are for testing. MIT indoor 67... We use the train/test split provided in (Quattoni and Torralba 2009). Caltech 101... We train on 30 and test on 50 images per category. Caltech 256... We train on 60 images per category, the rest for testing. SUN 397... We use the first 3 train/test splits of (Xiao et al. 2010). Youtube... report the average of the 25-fold leave one group cross validation accuracy rates.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments, only general statements about computational efficiency.
Software Dependencies No The paper mentions general software components like 'linear SVM classifier' and 'k-means algorithm', and uses a 'CNN model we use is imagenet-vgg-verydeep16', but it does not specify any software dependencies with version numbers.
Experiment Setup Yes The CNN model we use is imagenet-vgg-verydeep16 in (Simonyan and Zisserman 2015) till the last convolutional layer, and the input image is resized such that its shortest edge is not smaller than 314 pixels, and its longest edge is not larger than 1120 pixels. Six spatial regions are used, corresponding to the level 1 and 0 regions in (Wu and Rehg 2011). (Gao et al. 2015) finds that FV/VLAD usually achieves optimal performance with very small K sizes in DSP. Hence, we test K {4, 8}. and We use the linear SVM classifier. and improved trajectory features (ITF) with default parameters.