Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context
Authors: Rohan Paul, Andrei Barbu, Sue Felshin, Boris Katz, Nicholas Roy
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Factors in the model are trained in a data-driven manner using an aligned vision-language corpus. We demonstrate the approach on a Baxter Research Robot following and executing complex natural language instructions in a manipulation domain using a standardized object data set. ... To evaluate our approach quantitatively we collected a video corpus of humans performing actions while providing declarative facts and commands for the robot to execute. |
| Researcher Affiliation | Academia | Rohan Paul and Andrei Barbu and Sue Felshin and Boris Katz and Nicholas Roy Massachusetts Institute of Technology, Cambridge, MA ... Contact: {rohanp, abarbu, sfelshin, boris, nickroy}@csail.mit.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link for 'Video demonstrations and the corpus used for quantitative evaluation' but does not explicitly state that source code for the methodology is provided. |
| Open Datasets | Yes | Our corpus consists of longer videos composed by combining 96 short, 3 second long, videos consisting of a person performing one action out of 5... with one of eight objects from the YCB data set [Calli et al., 2015]... Video demonstrations and the corpus used for quantitative evaluation are available at: http://toyota.csail.mit.edu/node/28 |
| Dataset Splits | No | The paper describes dataset collection and the total number of video-sentence pairs used for evaluation, but it does not provide specific percentages or counts for training, validation, and testing splits, nor does it explicitly mention a validation set. |
| Hardware Specification | Yes | The system was deployed on the Baxter Research Robot operating on a tabletop workspace. The robot observed the workspace using images captured using a cross-calibrated Kinect version 2 RGB-D sensor... Spoken commands from the human operator were converted to text using an Amazon Echo Dot. |
| Software Dependencies | No | The paper mentions software components like START for parsing and a binary SVM for object recognition, but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The imperative grounding factor... was trained using an aligned corpus of language instructions paired with scenes where the robot performs a manipulation task. A data set consisting of 51 language instructions paired with randomized world configurations generating a total of 4160 examples of individual constituent-grounding factors. Ground truth was assigned by hand. A total of 1860 features were used for training. Parameters were trained using a quasi-Newton optimization procedure. The declarative grounding factor... An EM-like algorithm acquired the parameters... using a corpus of 15 short videos, 4 seconds long, of agents performing actions in the workspace. |