Structure-Aware Multimodal Sequential Learning for Visual Dialog
Authors: Young-Jin Kim, Min-Jun Kim, Kyunghwan An, Jinwoo Ahn, Jaeseok Kim, Yu-Jung Heo, Du-Seong Chang, Eun-Sol Kim
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For experiments, we achieved a new state-of-the-art performance on three visual dialog datasets, including the most challenging one COMET. |
| Researcher Affiliation | Collaboration | 1 Department of Artificial Intelligence Application, Hanyang University, South Korea... 3 KT Corporation |
| Pseudocode | No | The paper describes the algorithm and architecture but does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We mainly evaluate our algorithm on the most challenging visual dialog dataset, COMET (Kottur et al. 2022) and use two commonly used visual dialog datasets: Vis Dial 1.0 (Vis Dial) (Das et al. 2017), and MNIST Dialog (Seo et al. 2017). |
| Dataset Splits | Yes | Vis Dial As presented in table 5, our model exhibits remarkable results on overall metrics, compared to the baselines on Vis Dial v1.0 validation set. |
| Hardware Specification | Yes | It takes 5 hours for 20 epoch training with 64 batch size on a 4-A100 machine. |
| Software Dependencies | No | The paper mentions pretrained models like ViT-base and Flan-T5-base, and the AdamW optimizer, but does not provide specific version numbers for software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | It takes 5 hours for 20 epoch training with 64 batch size on a 4-A100 machine. [...] We use the Adam W optimizer (Loshchilov and Hutter 2017) with β1 = 0.9, β2 = 0.98, and weight decay of 0.05. We use a piecewise linear scheduler with a linear warmup of 2K steps starting from a learning rate of 1e-4 and a peak learning rate of 1e-3. |