Open Domain Dialogue Generation with Latent Images
Authors: Ze Yang, Wei Wu, Huang Hu, Can Xu, Wei Wang, Zhoujun Li14239-14247
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies are conducted in both image-grounded conversation and text-based conversation. |
| Researcher Affiliation | Collaboration | 1 State Key Lab of Software Development Environment, Beihang University, Beijing, China 2 Meituan, Beijing, China 3 Microsoft, Beijing, China 4 China Resources Group, Shenzhen, China |
| Pseudocode | No | The paper describes the models and their components, and includes a model architecture diagram (Figure 2), but does not provide any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides links to evaluation scripts and baseline implementations (e.g., 'https://github.com/Maluuba/nlg-eval', 'https://github.com/IBM/pytorch-seq2seq'), but no explicit statement or link for the source code of the authors' proposed method (IMGVAE). |
| Open Datasets | Yes | For image-grounded dialogue set DI, we choose Image-Chat data published in (Shuster et al. 2020)... For the textual dialogue set DT , we use the Reddit Conversation Corpus1 published by (Dziri et al. 2018) |
| Dataset Splits | Yes | The training/validation/test sets are split into 186,782/5,000/9,997 respectively... we randomly sample 1M/20K/20K dialogues as the training/validation/test set of the Reddit data. |
| Hardware Specification | Yes | Our model is trained on 4 Tesla 32GB P40 GPUs in a data-parallel manner with batch size 100. |
| Software Dependencies | No | The paper mentions using the Adam algorithm and implies the use of PyTorch through baseline implementations, but does not specify version numbers for any software dependencies used in their own model's implementation. |
| Experiment Setup | Yes | In both tasks, d1, d2, d3, and d4 are set as 512, 48, 768, and 300 respectively. The image reconstructor has 2 attentional visual refiners (i.e. m = 2), and the number of image sub-regions N0 and N1 are set as 64 64 and 128 128 respectively. The dimension of ϵ and the dimension of the augmented conditioning vector are set as 100. ...We learn all models using Adam algorithm (Kingma and Ba 2015) and the learning rates for image reconstructor and response generator are set as 1 10 4 and 1 10 3 respectively. |