Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos

Authors: Mohamed Elhoseiny, Jingen Liu, Hui Cheng, Harpreet Sawhney, Ahmed Elgammal

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validated our method on the large TRECVID MED (Multimedia Event Detection) challenge. Using only the event title as a query, our method outperformed the state-of-the-art that uses big descriptions from 12.6% to 13.5% with MAP metric and 0.73 to 0.83 with ROC-AUC metric.
Researcher Affiliation Collaboration Mohamed Elhoseiny , Jingen Liu , Hui Cheng , Harpreet Sawhney , Ahmed Elgammal m.elhoseiny@cs.rutgers.edu,{jingen.liu,hui.cheng,harpreet.sawhney}@sri.com, elgammal@cs.rutgers.edu Rutgers University, Computer Science Department SRI International, Vision and Learning Group
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Supplementary Materials (SM) could be found here https://sites. google.com/site/mhelhoseiny/EDi SE supp.zip
Open Datasets Yes We evaluated our method on the large TRECVID MED (Felzenszwalb, Mc Allester, and Ramanan 2013).
Dataset Splits No The paper mentions evaluating on the 'MEDTest set' of TRECVID MED, but does not explicitly provide details about specific training/validation/test splits for their experiments on this dataset (e.g., percentages, sample counts, or explicit standard split citations for all three partitions).
Hardware Specification Yes it takes 270 seconds on a 16 cores Intel Xeon processor (64GB RAM) to the retrieval task on 20 events altogether.
Software Dependencies No The paper mentions various models and tools used (e.g., Mikolov et al. 2013b for word embedding, Overfeat, SIFT, HOG), but it does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes In practice, we only include θv(ci) in ψ(vc) such that ci is among the top R concepts with highest p(ec|ci). This is assuming that the remaining concepts are assigned p(ec|ci) = 0 which makes those items vanish; we used R=5. We fuse p(ec|v), p(eo|v), and p(ea|v) by weighted geometric mean with focus on visual concepts, i.e. p(e|v) = w+1 p(eo|v)p(ea|v)); w = 6. M= 250. l = 50% (i.e., median). M = 300.