Acquiring Common Sense Spatial Knowledge Through Implicit Spatial Templates
Authors: Guillem Collell, Luc Van Gool, Marie-Francine Moens
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present two simple neural-based models that leverage annotated images and structured text to learn this task. The good performance of these models reveals that spatial locations are to a large extent predictable from implicit spatial language. Crucially, the models attain similar performance in a challenging generalized setting, where the object-relation-object combinations (e.g., man walking dog ) have never been seen before. Next, we go one step further by presenting the models with unseen objects (e.g., dog ). In this scenario, we show that leveraging word embeddings enables the models to output accurate spatial predictions, proving that the models acquire solid common sense spatial knowledge allowing for such generalization. |
| Researcher Affiliation | Academia | Guillem Collell Department of Computer Science KU Leuven gcollell@kuleuven.be Luc Van Gool Computer Vision Laboratory ETH Zurich vangool@vision.ee.ethz.ch Marie-Francine Moens Department of Computer Science KU Leuven sien.moens@cs.kuleuven.be |
| Pseudocode | No | The paper describes algorithms but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | This evaluation set along with our Supplementary material are available at https://github.com/gcollell/spatial-commonsense. |
| Open Datasets | Yes | We use the Visual Genome dataset (Krishna et al. 2017) as our source of annotated images. The Visual Genome consists of 108K images containing 1.5M human-annotated (Subject, Relationship, Object) instances with bounding boxes for Subject and Object (Fig. 2). |
| Dataset Splits | Yes | We employ a 10-fold cross-validation (CV) setting. Data are randomly split into 10 disjoint parts and 10% is employed for testing and 90% for training, repeating this for each of the 10 folds. Reported results are averages over the 10 folds. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU model, CPU model, memory) used for running its experiments. |
| Software Dependencies | Yes | Our experiments are implemented in Python 2.7 and we use Keras deep learning framework for our models (Chollet and others 2015). |
| Experiment Setup | Yes | Model hyperparameters are first selected in a 10fold cross-validation setting and we report (averaged) results on 10 new splits. Models are trained for 10 epochs on batches of size 64 with the RMSprop optimizer using a learning rate of 0.0001 and 2 hidden layers with 100 Re Lu units. |