Self-View Grounding Given a Narrated 360° Video

Authors: Shih-Han Chou, Yi-Chun Chen, Kuo-Hao Zeng, Hou-Ning Hu, Jianlong Fu, Min Sun

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our method, we collect the first narrated 360 videos dataset and achieve state-of-the-art NFo V-grounding performance. ... Experiments Because the style of indoor videos and outdoor videos are different on both vision and subtitles, we first conduct the ablation studies of our proposed method and compare our model with baselines in the beginning. ... Finally, we show the results and make a brief discussion.
Researcher Affiliation Collaboration Shih-Han Chou, Yi-Chun Chen, Kuo-Hao Zeng, Hou-Ning Hu, Jianlong Fu, Min Sun Department of Electrical Engineering, National Tsing Hua University Microsoft Research, Beijing, China {happy810705, yichun8447}@gmail.com, khzeng@cs.stanford.edu {eborboihuc@gapp, sunmin@ee}.nthu.edu.tw, jianf@microsoft.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper states: 'We implement all of our methods by Py Torch (Paszke and Chintala )', but it does not provide a specific link or explicit statement about releasing their source code for the described methodology.
Open Datasets Yes To evaluate our method, we collect the first narrated 360 videos dataset. This dataset consists of touring videos, including scenic spots and housing introduction, and subtitles files, including subtitle text and start and end timecode. ... (Available at http://aliensunmin.github.io/project/360grounding/)... We use Res Net-101 pre-training on Image Net (Deng et al. 2009) as our visual encoder and we pre-train our language decoder on MSCOCO dataset (Lin et al. 2014b).
Dataset Splits Yes We assign 80% of the videos and subtitles for training and 10% each for validation and testing.
Hardware Specification Yes We conduct all experiments on a single computer with Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz, 64GB RAM DDR3, and an NVIDIA Titan X GPU.
Software Dependencies No The paper mentions 'Py Torch' but does not specify a version number. It also references models like ResNet-101, GRU, and LSTM, but these are architectural components, not software dependencies with version numbers for reproducibility.
Experiment Setup Yes We set λ = 0.8. ... We decrease the frame rate to 1 to save memory usage and set dictionary dimension as 9956 according to the number of words appearing in all subtitles. We randomly sample 3 consecutive frames during training phase (i.e., k = 3)... Since the maximal length of subtitles is 33, we set m = 33... We use Adam (Kingma and Ba 2015) as opti-mizer with default hyperparameters and 0.001 learning rate and set batch size B by 4.