Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context

Authors: Rohan Paul, Andrei Barbu, Sue Felshin, Boris Katz, Nicholas Roy

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Factors in the model are trained in a data-driven manner using an aligned vision-language corpus. We demonstrate the approach on a Baxter Research Robot following and executing complex natural language instructions in a manipulation domain using a standardized object data set. ... To evaluate our approach quantitatively we collected a video corpus of humans performing actions while providing declarative facts and commands for the robot to execute.
Researcher Affiliation Academia Rohan Paul and Andrei Barbu and Sue Felshin and Boris Katz and Nicholas Roy Massachusetts Institute of Technology, Cambridge, MA ... Contact: {rohanp, abarbu, sfelshin, boris, nickroy}@csail.mit.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a link for 'Video demonstrations and the corpus used for quantitative evaluation' but does not explicitly state that source code for the methodology is provided.
Open Datasets Yes Our corpus consists of longer videos composed by combining 96 short, 3 second long, videos consisting of a person performing one action out of 5... with one of eight objects from the YCB data set [Calli et al., 2015]... Video demonstrations and the corpus used for quantitative evaluation are available at: http://toyota.csail.mit.edu/node/28
Dataset Splits No The paper describes dataset collection and the total number of video-sentence pairs used for evaluation, but it does not provide specific percentages or counts for training, validation, and testing splits, nor does it explicitly mention a validation set.
Hardware Specification Yes The system was deployed on the Baxter Research Robot operating on a tabletop workspace. The robot observed the workspace using images captured using a cross-calibrated Kinect version 2 RGB-D sensor... Spoken commands from the human operator were converted to text using an Amazon Echo Dot.
Software Dependencies No The paper mentions software components like START for parsing and a binary SVM for object recognition, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes The imperative grounding factor... was trained using an aligned corpus of language instructions paired with scenes where the robot performs a manipulation task. A data set consisting of 51 language instructions paired with randomized world configurations generating a total of 4160 examples of individual constituent-grounding factors. Ground truth was assigned by hand. A total of 1860 features were used for training. Parameters were trained using a quasi-Newton optimization procedure. The declarative grounding factor... An EM-like algorithm acquired the parameters... using a corpus of 15 short videos, 4 seconds long, of agents performing actions in the workspace.