Extracting Visual Knowledge from the Web with Multimodal Learning

Authors: Dihong Gong, Daisy Zhe Wang

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results based on 46 object categories show that the extraction precision is improved significantly from 73% (with state-of-the-art deep learning programs) to 81%, which is equivalent to a 31% reduction in error rates.
Researcher Affiliation Academia Dihong Gong, Daisy Zhe Wang Department of Computer and Information Science and Engineering University of Florida {gongd, daisyw}@ufl.edu
Pseudocode No The paper describes algorithms but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No In this paper, we have applied the hierarchical softmax whose implementation is based on Google word2vec1. 1https://code.google.com/archive/p/word2vec
Open Datasets Yes We evaluate our approach based on a collection of web pages and images derived from the Common Crawl dataset [Smith et al., 2013] that is publicly available on Amazon S3.
Dataset Splits No The paper mentions training data for visual object detectors and evaluation sample sizes, but does not specify explicit train/validation/test splits for the main dataset or experiments.
Hardware Specification Yes The Caffe Net models with feature dimension of 4096 were trained on a NVIDIA Tesla K40c GPU.
Software Dependencies No Parse the HTML webpages, with a C++ open-source program Gumbo-Parser by Google2.
Experiment Setup Yes For multimodal embedding, we set the dimension of vector representations as 500 (we found that dimensions between 100 and 1000 give similar results) according to the recommendation from [Frome et al., 2013]. For structure learning, we tune the λ parameter in Equation (7) on training data such that the number of non-zero elements is around 100 for the θ parameter.