Learning Multi-Modal Grounded Linguistic Semantics by Playing “I Spy”
Authors: Jesse Thomason, Jivko Sinapov, Maxwell Svetlik, Peter Stone, Raymond J. Mooney
IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results test generalization to new objects not seen during training and illustrate both that the system learns accurate word meanings and that modalities beyond vision improve its performance. We demonstrate that our multi-modal system for grounding natural language outperforms a traditional, vision-only grounding framework by comparing the two on the I Spy task. |
| Researcher Affiliation | Academia | Jesse Thomason, Jivko Sinapov, Maxwell Svetlik, Peter Stone, and Raymond J. Mooney Department of Computer Science, University of Texas at Austin Austin, TX 78712, USA {jesse, jsinapov, maxwell, pstone, mooney}@cs.utexas.edu |
| Pseudocode | No | The paper describes the methodology and logic in prose and mathematical formulas (e.g., Section 5.2 Grounded Language Learning) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper provides links to third-party tools used (ROS, Festival) and a video demonstration, but does not include an explicit statement or link for the open-source code of the methodology described in the paper. |
| Open Datasets | Yes | The set of objects used in this experiment consisted of 32 common household items including cups, bottles, cans, and other containers, shown in Figure 2. Some of the objects contained liquids or other contents (e.g., coffee beans) while others were empty. Contemporary work gives a more detailed description of this object dataset [Sinapov et al., 2016]. |
| Dataset Splits | Yes | We divided our 32-object dataset into 4 folds. For each fold, at least 10 human participants played I Spy with both the vision only and multi-modal systems... For subsequent folds, the systems were incrementally trained using labels from previous folds only, such that the systems were always being tested against novel, unseen objects. Training the predicate classifiers using leave-one-out cross validation over objects, we calculated the average precision, recall, and F1 scores of each against human predicate labels on the held-out object. |
| Hardware Specification | Yes | The robot used in this study was a Kinova MICO arm mounted on top of a custom-built mobile base which remained stationary during our experiment. The robot s perception included joint effort sensors in each of the robot arm s motors, a microphone mounted on the mobile base, and an Xtion ASUS Pro RGBD camera. |
| Software Dependencies | No | The paper mentions the use of Robot Operating System (ROS), Festival Speech Synthesis System, VGG network, and Point Cloud Library, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Prior to the experiment, the robot explored the objects using the methodology described by Sinapov et al. [2014a]... In our case, the robot used 7 distinct actions: grasp, lift, hold, lower, drop, push, and press... During the execution of each action, the robot recorded the sensory perceptions from haptic (i.e., joint efforts) and auditory sensory modalities... The joint efforts and joint positions were recorded for all 6 joints at 15 Hz... Each context classifier Mc, c 2 C was a quadratic-kernel SVM trained with positive and negative labels... |