Unsupervised Deep Learning of Mid-Level Video Representation for Action Recognition

Authors: Jingyi Hou, Xinxiao Wu, Jin Chen, Jiebo Luo, Yunde Jia

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the HMDB51 and the UCF101 datasets demonstrate the superiority of the proposed method, even over several supervised learning methods.
Researcher Affiliation Academia 1. Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing 100081, China 2. Department of Computer Science, University of Rochester, Rochester NY 14627, USA
Pseudocode Yes Algorithm 1: Iterative clustering algorithm for deep video representation.
Open Source Code No The paper does not provide any statement or link indicating the availability of open-source code for the described methodology.
Open Datasets Yes Extensive experiments are conducted on the HMDB51 (Kuehne et al. 2011) and the UCF101 (Soomro, Zamir, and Shah 2012) datasets to evaluate the performance of our method.
Dataset Splits Yes We follow the standard evaluation protocols of the two datasets provided on (Kuehne et al. 2011) and (Soomro, Zamir, and Shah 2012) to calculate the average accuracy over the three splits into training and test data.To prevent over-fitting, we split the input video volume set V into M non-overlapping subsets {V1, ..., VM} for cross-validation.The number of cross-validation sets is m = 10.
Hardware Specification Yes The training of the deep networks is implemented on a single NVIDIA TITAN X GPU with the memory of 12G.
Software Dependencies No The paper mentions general components like 'deep neural networks' and 'Re LU activations' but does not specify any particular software libraries, frameworks, or their version numbers used for implementation.
Experiment Setup Yes The input IT features... dimension of each IT feature is 396. The centers of the input volumes are sampled at the middle of the trajectories, and the size of each volume is 16 16 12. The number of cross-validation sets is m = 10. The number of the topmost samples selected from each cluster during iteration is p = 100. The initial number of clusters is set to 600 for the fully connected autoencoder and 1,000 for the fully convolutional autoencoder, and after optimizing, it comes to q = 499 and 762, respectively. The dimension of the output of the proposed networks is set to 198 and 512. As for the linear SVMs, the value of the penalty parameter is chosen among 10 3, 10 2, 10 1, 1, 101, 102, 103.