Unsupervised Alignment of Actions in Video with Text Descriptions
Authors: Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea, Henry Kautz
IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section describes the evaluation of hyperfeature construction and alignment of actions on two multimodal datasets with parallel video and text. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Rochester, Rochester, NY, USA 2Indian Institute of Technology Delhi, New Delhi, India |
| Pseudocode | Yes | Algorithm 1 describes this process in detail. |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for their methodology is publicly available. |
| Open Datasets | Yes | The Wetlab dataset [Naim et al., 2014; 2015], The TACo S corpus [Regneri et al., 2013], We evaluate our system on action features generated by CNN models trained using the UCF101 action recognition dataset [Soomro et al., 2012]. |
| Dataset Splits | No | The paper evaluates on datasets like Wetlab and TACo S, stating that ground truth segmentation is used for evaluation in the latter. However, it does not specify explicit training, validation, and testing splits (e.g., percentages or counts) for model reproduction. |
| Hardware Specification | Yes | Each iteration per video took an average of 6.6 seconds on a single core of a 2.4GHz Intel Xeon processor with 32GB of RAM. |
| Software Dependencies | No | The paper mentions using a 'two-stage Charniak-Johnson parser', a 'Kalman filter', and a 'modified version of the SLIC superpixel algorithm', but does not provide specific version numbers for these or any other software components used in the experiments. |
| Experiment Setup | Yes | For hyperfeature variables {d(1), w, d(2)}, we achieved best results using {64, 150, 32} for STIP, {128, 150, 32} for dense trajectory, and {128, 150, 64} for CNN features. For all the variations, we train LCRF models by running 200 iterations over the entire dataset. |