Unsupervised Alignment of Natural Language Instructions with Video Segments

Authors: Iftekhar Naim, Young Song, Qiguang Liu, Henry Kautz, Jiebo Luo, Daniel Gildea

AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our algorithm on videos of biological experiments performed in wetlabs, and demonstrate its capability of aligning video segments to text instructions and matching video objects to nouns in the absence of any direct supervision. We perform experiments on six wetlab videos (three protocols, two videos per protocol). The alignment results (Table 2) show that our algorithm outperforms the uniform baseline, both on Anvil annotations and on the output of the computer vision system.
Researcher Affiliation Academia Iftekhar Naim, Young Chol Song, Qiguang Liu, Henry Kautz, Jiebo Luo, Daniel Gildea Department of Computer Science, University of Rochester Rochester, NY 14627
Pseudocode No The paper describes the EM algorithm and models but does not contain a structured pseudocode block or algorithm figure.
Open Source Code No The paper does not provide any concrete access to source code (e.g., a repository link or an explicit statement about code release).
Open Datasets No Our wetlab dataset has three different protocols: Cellobiose M9 Media (CELL), LB Liquid Growth Media (LLGM), and Yeast YPAD Media (YPAD). Each protocol consists of a sequence of instructions. ... We manually annotate each of the videos to specify the objects touched by the hands using the video annotation tool Anvil (Kipp 2012). There is no explicit statement or link indicating this dataset is publicly available.
Dataset Splits No We perform experiments on six wetlab videos (three protocols, two videos per protocol). To compare the errors introduced by our alignment algorithm and automated video segmentation and tracking systems, we evaluate alignment and matching accuracy both using automatically segmented videos and hand annotated videos. No explicit train/validation/test splits are mentioned.
Hardware Specification No The paper mentions 'HD video camera and an ASUS Xtion Pro RGB-Depth sensor' for data capture, but no specific hardware details (GPU, CPU models, memory, etc.) for running the computational experiments.
Software Dependencies No The paper mentions specific tools and algorithms like 'Charniak-Johnson syntactic parser', 'SLIC superpixel algorithm', '3D Kalman filter', and 'Anvil', but does not provide specific version numbers for any of them or for other software dependencies.
Experiment Setup No The paper describes the EM algorithm steps (initialization, E-step, M-step) and uniform initialization, but does not provide specific numerical hyperparameters like number of iterations, learning rates, or batch sizes.