Placing Objects in Gesture Space: Toward Incremental Interpretation of Multimodal Spatial Descriptions

Authors: Ting Han, Casey Kennington, David Schlangen

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we model the hearer s task, using a multimodal spatial description corpus we collected. To reduce the variability of verbal descriptions, we simplified the setup to use simple objects as landmarks. We describe a real-time system to evaluate the separate and joint contributions of the modalities. We show that gestures not only help to improve the overall system performance, even if to a large extent they encode redundant information, but also result in earlier final correct interpretations.
Researcher Affiliation Academia Ting Han,1 Casey Kennington,2 David Schlangen1 1Dialogue Systems Group // CITEC, Bielefeld University, 2Boise State University {ting.han, david.schlangen}@uni-bielefeld.de, caseykennington@boisestate.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The corpus is publicly available.1 (1https://pub.uni-bielefeld.de/data/2913177). This refers to the corpus, not the source code for the methodology or system described in the paper.
Open Datasets Yes We collected a multimodal spatial description corpus which was elicited with a simplified scene description task (see details in Data collection). The corpus is publicly available.1 (1https://pub.uni-bielefeld.de/data/2913177).
Dataset Splits No The paper states 'The training was stopped when validation loss stopped decreasing.' which implies a validation set was used, but it does not provide specific details on how this validation split was created (e.g., percentages, counts, or specific methodology for the split beyond the hold-one-out for train/test).
Hardware Specification No The paper mentions that 'hand motion was tracked by a Leap sensor' and 'The classification for each stroke hold takes around 10 to 20 ms, correlated to the computational ability of the machine.' However, it does not provide specific hardware details (e.g., CPU/GPU models, memory) of the machine used for running experiments.
Software Dependencies No The paper mentions software like 'Keras (Chollet 2015)', 'Inpro TK toolkit (Baumann and Schlangen 2012)', and 'ELAN, a software for annotation', but it does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The LSTM classifier includes two hidden layers and a sigmoid dense layer to give predictions. The first hidden layer has 68 nodes whose outputs are defined by tanh activation functions. The second hidden layer has 38 nodes and outputs via the dense layer. A dropout layer is applied to the second layer to enable more effective learning. 50% of the input units are randomly selected and set to 0 to avoid overfitting. We chose a binary cross entropy loss function optimised with a rmsprop optimiser. The training was stopped when validation loss stopped decreasing. We fit a Gaussian KDE model (with the bandwidth setting to 5). When combining speech with gestures, the average eo is slightly higher.