Structured Co-reference Graph Attention for Video-grounded Dialogue
Authors: Junyeong Kim, Sunjae Yoon, Dahyun Kim, Chang D. Yoo1789-1797
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The validity of the proposed SCGA is demonstrated on AVSD@DSTC7 and AVSD@DSTC8 datasets, a challenging video-grounded dialogue benchmarks, and TVQA dataset, a large-scale video QA benchmark. Our empirical results show that SCGA outperforms other state-of-the-art dialogue systems on both benchmarks, while extensive ablation study and qualitative analysis reveal performance gain and improved interpretability. |
| Researcher Affiliation | Academia | Junyeong Kim, Sunjae Yoon, Dahyun Kim, Chang D. Yoo Korea Advanced Institute of Science and Technology (KAIST) |
| Pseudocode | No | The paper describes its method using textual descriptions and equations, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | Yes | AVSD (Alamri et al. 2019a) is a widely used benchmark dataset for video-grounded dialogue, which are collected on the Charades (Sigurdsson et al. 2016) humanactivity dataset. ... TVQA (Lei et al. 2018) is a large-scale benchmark dataset for multi-modal video question answering, which consists multiple-choice QA pairs for short video clips and corresponding subtitles. |
| Dataset Splits | Yes | AVSD... It contains 7,659, 1,787, 1,710 dialogues for training, validation and test, respectively. ... TVQA... It contains 122,039, 15,252, 7,623 QAs for training, validation and test, respectively. |
| Hardware Specification | Yes | Our model is trained on NVIDIA TITAN V (12GB of memory) GPU with Adam optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 10 9. |
| Software Dependencies | No | The entire framework is implemented with Py Torch. The paper does not specify version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | The dimension of hidden layer is set to d = 512, the number of attention heads for GAT and decoder is set to K = 8. Criterions for edge Est are set to τs = 0.4, τt = 0.2 for sparse local connection. For GNGAT, we set distance n = 1, 2, 3, 4, and 1, 1, 2, 4 heads are assigned to each distance, respectively. ... We adopt a learning rate strategy similar to (Vaswani et al. 2017), and set the learning rate warm-up strategy to 10, 000 training steps and trained model up to 20 epochs. We select the batch size of 32 and dropout rate of 0.3. |