EXMOVES: Classifier-based Features for Scalable Action Recognition

Authors: Du Tran; Lorenzo Torresani

ICLR 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the generality of our approach by building our mid-level descriptors from two different low-level feature vectors. The accuracy and efficiency of the approach are demonstrated on several large-scale action recognition benchmarks. 4. Experiments
Researcher Affiliation Academia Du Tran, Lorenzo Torresani {DUTRAN,LORENZO}@CS.DARTMOUTH.EDU Computer Science Department, Dartmouth College, NH 03755 USA
Pseudocode Yes Algorithm 1 EXMOVE training
Open Source Code Yes Additional material including software to extract EXMOVES from videos is available at http://vlg.cs.dartmouth.edu/exmoves.
Open Datasets Yes HMDB51 (Kuehne et al., 2011): It consists of 6849 image sequences collected from movies as well as You Tube and Google videos. They represent 51 action categories. 2. Hollywood-2 (Marszalek et al., 2009): This dataset includes over 20 hours of video, subdivided in 3669 sequences, spanning 12 action classes. 3. UCF50: This dataset contains 6676 videos taken from You Tube for a total of 50 action categories. 4. UCF101 (Soomro et al.) (part 2): UCF101 is a superset of UCF50.
Dataset Splits Yes The results for this dataset are presented using 3-fold cross validation on the 3 publicly available training/testing splits. ... We report the accuracy of 25-fold cross validation using the publicly available training/testing splits.
Hardware Specification Yes Runtimes were measured on a single-core Linux machine with a CPU @ 2.66GHz.
Software Dependencies No The paper mentions software components like "exemplar-SVM", "linear SVM", "k-means", but does not provide specific version numbers for these or other libraries/frameworks.
Experiment Setup Yes As in (Wang et al., 2013), we use a dictionary of 25,000 visual words for Dense Trajectories and 5,000 visual words for HOG-HOF-STIPs. ... The hyperparameter C of the SVM is tuned via cross-validation for all baselines, Action Bank, and our EXMOVES. ... The scales are 1, 0.75, 0.5), and Np = 73 space-time volumes obtained by recursive octree subdivision of the entire video using 3 levels ... In our implementation we use M = 10, but we find that in more than 85% of the cases, the learning procedure converges before reaching this maximum number of iterations. ... we add to the active set only the volumes that yield the largest violations in each video, for a maximum of k = 3 per negative video and k+ = 10 for the positive video.