Structured Co-reference Graph Attention for Video-grounded Dialogue

Authors: Junyeong Kim, Sunjae Yoon, Dahyun Kim, Chang D. Yoo1789-1797

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The validity of the proposed SCGA is demonstrated on AVSD@DSTC7 and AVSD@DSTC8 datasets, a challenging video-grounded dialogue benchmarks, and TVQA dataset, a large-scale video QA benchmark. Our empirical results show that SCGA outperforms other state-of-the-art dialogue systems on both benchmarks, while extensive ablation study and qualitative analysis reveal performance gain and improved interpretability.
Researcher Affiliation Academia Junyeong Kim, Sunjae Yoon, Dahyun Kim, Chang D. Yoo Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode No The paper describes its method using textual descriptions and equations, but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes AVSD (Alamri et al. 2019a) is a widely used benchmark dataset for video-grounded dialogue, which are collected on the Charades (Sigurdsson et al. 2016) humanactivity dataset. ... TVQA (Lei et al. 2018) is a large-scale benchmark dataset for multi-modal video question answering, which consists multiple-choice QA pairs for short video clips and corresponding subtitles.
Dataset Splits Yes AVSD... It contains 7,659, 1,787, 1,710 dialogues for training, validation and test, respectively. ... TVQA... It contains 122,039, 15,252, 7,623 QAs for training, validation and test, respectively.
Hardware Specification Yes Our model is trained on NVIDIA TITAN V (12GB of memory) GPU with Adam optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 10 9.
Software Dependencies No The entire framework is implemented with Py Torch. The paper does not specify version numbers for PyTorch or other software dependencies.
Experiment Setup Yes The dimension of hidden layer is set to d = 512, the number of attention heads for GAT and decoder is set to K = 8. Criterions for edge Est are set to τs = 0.4, τt = 0.2 for sparse local connection. For GNGAT, we set distance n = 1, 2, 3, 4, and 1, 1, 2, 4 heads are assigned to each distance, respectively. ... We adopt a learning rate strategy similar to (Vaswani et al. 2017), and set the learning rate warm-up strategy to 10, 000 training steps and trained model up to 20 epochs. We select the batch size of 32 and dropout rate of 0.3.