Segment-Then-Rank: Non-Factoid Question Answering on Instructional Videos

Authors: Kyungjae Lee, Nan Duan, Lei Ji, Jason Li, Seung-won Hwang8147-8154

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental result demonstrates that our model achieves state-of-the-art performance.
Researcher Affiliation Collaboration Kyungjae Lee,1 Nan Duan,2 Lei Ji,2,3 Jason Li,4 Seung-won Hwang1 1Department of Computer Science, Yonsei University, Seoul, South Korea 2Microsoft Research Asia, Beijing, China 3University of Chinese Academy of Science, Beijing, China 4STCA Multimedia Group, Microsoft, Beijing, China
Pseudocode No The paper provides architectural descriptions and mathematical formulations for its models but does not include explicit pseudocode or algorithm blocks.
Open Source Code No The paper refers to an open-source detector used by the authors ('Using opensource detector 1, we extract object categories c from images in clip Tk,' with footnote '1https://github.com/peteanderson80/bottom-up-attention'), but does not explicitly state that the code for their own proposed methodology ('Segmenter-Ranker') is publicly available.
Open Datasets No The paper states, 'For training and evaluating this task, we collect labelled resources of 37K QA pairs and 21K video (total 1,662 hours)' and 'For such purpose, we contribute a labeled dataset of 37K QA pairs on instructional videos for benchmarking,' but does not provide a specific link, DOI, or repository for public access to this dataset.
Dataset Splits Yes We divide the dataset into 29K/4k/4k as training/dev/test set respectively, where the videos do not overlap in each set.
Hardware Specification No The paper discusses computational expense related to certain features (e.g., ResNet-50/101 features having large dimensions) but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its own experiments.
Software Dependencies No The paper mentions using a 'base version of BERT (Devlin et al. 2018) with 12 layers' and the 'Adam optimizer', but it does not specify concrete version numbers for the software libraries (e.g., TensorFlow, PyTorch) or specific BERT model versions used, which are necessary for reproducible dependency descriptions.
Experiment Setup Yes We use a base version of BERT (Devlin et al. 2018) with 12 layers as our encoder, following its default setting. We train our model on BERT until 3 epochs, and use the Adam optimizer with a learning rate of 0.00005. In Segmenter, we extract N = 9 span candidates, from the output probabilities. In Ranker, training data has 1:9 positive and negative ratio, then this module ranks top 9 candidates at inference time. For CNN layer, the number lf of layers is 30, and top nt = 7 elements in max-pooling are extracted, which are optimized on dev set. For detecting image objects, we sample frames as 1 fps in videos.