Grounding of Human Environments and Activities for Autonomous Robots
Authors: Muhannad Alomari, Paul Duckworth, Nils Bore, Majd Hawasly, David C. Hogg, Anthony G. Cohn
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present three experiments to evaluate the system s performance in: 1) unsupervised concept extraction, 2) unsupervised language grounding, and 3) simple sentence generation to describe previously unseen video clips. We use a publicly-available long-term human activity dataset collected over the period of five days by a mobile robot from multiple view points |
| Researcher Affiliation | Academia | 1University of Leeds, United Kingdom. 2Royal Institute of Technology (KTH), Sweden |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | No | The paper does not include an unambiguous statement or a direct link to a source-code repository for the methodology described in this paper. |
| Open Datasets | Yes | We use a publicly-available long-term human activity dataset collected over the period of five days by a mobile robot from multiple view points 1. The dataset contains 493 video clips each containing a single human performing a simple activity in a kitchen area of an office environment (e.g. heating food, preparing hot drinks, using a multi-function printer, throwing trash, washing up, etc.) On top of the dataset, we collected natural language descriptions of each video clip using Amazon Mechanical Turk, where we requested Turkers to describe the activity in the clip and the person s appearance (given a fabricated name). A total of almost 3000 descriptions were collected (6 per clip on average). Example images from a video clip are shown in Fig. 5 along with a subset of the descriptions obtained. 1Dataset: http://doi.org/10.5518/86 |
| Dataset Splits | Yes | As an upper bound and to provide a reference result, we also show the V-measure results obtained using a supervised (linear) support vector machine classifier (SVM) with 4-fold cross-validation. |
| Hardware Specification | Yes | Our robot has two PCs with i7 processors running ROS indigo, and a single GTX 1050 Ti GPU with 2 GB of memory on which the convolutional pose machine for human pose estimation runs. |
| Software Dependencies | Yes | Our robot has two PCs with i7 processors running ROS indigo |
| Experiment Setup | Yes | After that, we incrementally process new data using Variational Bayes with a regular mini-batch size of 5 videos to allow frequent updating.For this task, we removed a video clip from the training data and passed it to the robot after training on the remaining videos in order to generate a sentence describing the unseen video. We repeated this 10 times.The two templates we use are person has a colour top and a colour lower garment and The person is activity using a(n) object . |