Structure-Aware Multimodal Sequential Learning for Visual Dialog

Authors: Young-Jin Kim, Min-Jun Kim, Kyunghwan An, Jinwoo Ahn, Jaeseok Kim, Yu-Jung Heo, Du-Seong Chang, Eun-Sol Kim

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For experiments, we achieved a new state-of-the-art performance on three visual dialog datasets, including the most challenging one COMET.
Researcher Affiliation Collaboration 1 Department of Artificial Intelligence Application, Hanyang University, South Korea... 3 KT Corporation
Pseudocode No The paper describes the algorithm and architecture but does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes We mainly evaluate our algorithm on the most challenging visual dialog dataset, COMET (Kottur et al. 2022) and use two commonly used visual dialog datasets: Vis Dial 1.0 (Vis Dial) (Das et al. 2017), and MNIST Dialog (Seo et al. 2017).
Dataset Splits Yes Vis Dial As presented in table 5, our model exhibits remarkable results on overall metrics, compared to the baselines on Vis Dial v1.0 validation set.
Hardware Specification Yes It takes 5 hours for 20 epoch training with 64 batch size on a 4-A100 machine.
Software Dependencies No The paper mentions pretrained models like ViT-base and Flan-T5-base, and the AdamW optimizer, but does not provide specific version numbers for software dependencies like programming languages or libraries.
Experiment Setup Yes It takes 5 hours for 20 epoch training with 64 batch size on a 4-A100 machine. [...] We use the Adam W optimizer (Loshchilov and Hutter 2017) with β1 = 0.9, β2 = 0.98, and weight decay of 0.05. We use a piecewise linear scheduler with a linear warmup of 2K steps starting from a learning rate of 1e-4 and a peak learning rate of 1e-3.