Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances

Authors: Thao Le Minh, Nobuyuki Shimizu, Takashi Miyazaki, Koichi Shinoda

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments to demonstrate the effectiveness of our proposed model as well as the benefit of our dataset. We compared our proposed model against the two unimodal recognition models for addressee recognition, as shown in Table 3. There were 369,306 utterances and corresponding images used for training; 123,102 for testing and the remaining 123,102 as the validation set for adjusting the classifier.
Researcher Affiliation Collaboration 1 Tokyo Institute of Technology, Tokyo, Japan 2 Yahoo Japan Corporation
Pseudocode No The paper includes a 'Network Architecture' diagram (Figure 2) but does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper states: 'Our ARVSU dataset will be released at https: //research-lab.yahoo.co.jp/en/software/.' This only refers to the dataset and not the open-source code for the described methodology.
Open Datasets Yes we created a mock dataset called Addressee Recognition in Visual Scenes with Utterances (ARVSU). Our ARVSU dataset will be released at https: //research-lab.yahoo.co.jp/en/software/.
Dataset Splits Yes There were 369,306 utterances and corresponding images used for training; 123,102 for testing and the remaining 123,102 as the validation set for adjusting the classifier.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU or CPU models. It only mentions software frameworks like Keras and TensorFlow.
Software Dependencies No The paper states: 'The proposed model was implemented using Keras 1 with Tensor Flow backend.' While 'Keras 1' is a specific major version, the version for TensorFlow is not provided, and only one component has a version specified.
Experiment Setup Yes The learning rate was set to 0.001 and the batch size was set to 64.