Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues

Authors: Hung Le, Nancy F. Chen, Steven Hoi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate the effectiveness of our method and provide additional insights on how models use semantic dependencies in a dialogue context to retrieve visual cues.
Researcher Affiliation Collaboration Hung Le , Nancy F. Chen , Steven C.H. Hoi Singapore Management University hungle.2018@smu.edu.sg A*STAR, Institute for Infocomm Research nfychen@i2r.a-star.edu.sg Salesforce Research Asia shoi@salesforce.com
Pseudocode Yes Algorithm 1: Compositional semantic graph of dialogue context
Open Source Code No The paper does not include an unambiguous statement or a direct link to the source code for the methodology described.
Open Datasets Yes We use the Audio-Visual Sene-Aware Dialogue (AVSD) benchmark developed by Alamri et al. (2019).
Dataset Splits Yes Train Val Test@DSTC7 Test@DSTC8 #Dialogs 7,659 1,787 1,710 1,710 #Questions/Answers 153,180 35,740 13,490 18,810 #Words 1,450,754 339,006 110,252 162,226
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models.
Software Dependencies Yes We first employ a co-reference resolution system, e.g. (Clark & Manning, 2016). We then explore using the Stanford parser system1 to discover sub-nodes. The parser decomposes each sentence into grammatical components, where a word and its modifier are connected in a tree structure. ... 1v3.9.2 retrieved at https://nlp.stanford.com/software/lex-parser.shtml ... word2vec embeddings2 and compute the cosine similarity score. ... 2https://code.google.com/archive/p/word2vec/ ... We experiment with the Adam optimizer (Kingma & Ba, 2015).
Experiment Setup Yes We experiment with the Adam optimizer (Kingma & Ba, 2015). The models are trained with a warm-up learning rate period of 5 epochs before the learning rate decays and the training finishes up to 50 epochs. The best model is selected by the average loss in the validation set. All model parameters, except the decoder parameters when using pre-trained language models, are initialized with uniform distribution (Glorot & Bengio, 2010). The Transformer hyper-parameters are fine-tuned by validation results over d = {128, 256}, h = {1, 2, 4, 8, 16}, and a dropout rate from 0.1 to 0.5. Label smoothing (Szegedy et al., 2016) is applied on labels of ˆ At (label smoothing does not help when optimizing over ˆRt as the labels are limited by the maximum length of dialogues, i.e. 10 in AVSD).