Extracting Visual Knowledge from the Web with Multimodal Learning
Authors: Dihong Gong, Daisy Zhe Wang
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results based on 46 object categories show that the extraction precision is improved significantly from 73% (with state-of-the-art deep learning programs) to 81%, which is equivalent to a 31% reduction in error rates. |
| Researcher Affiliation | Academia | Dihong Gong, Daisy Zhe Wang Department of Computer and Information Science and Engineering University of Florida {gongd, daisyw}@ufl.edu |
| Pseudocode | No | The paper describes algorithms but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | In this paper, we have applied the hierarchical softmax whose implementation is based on Google word2vec1. 1https://code.google.com/archive/p/word2vec |
| Open Datasets | Yes | We evaluate our approach based on a collection of web pages and images derived from the Common Crawl dataset [Smith et al., 2013] that is publicly available on Amazon S3. |
| Dataset Splits | No | The paper mentions training data for visual object detectors and evaluation sample sizes, but does not specify explicit train/validation/test splits for the main dataset or experiments. |
| Hardware Specification | Yes | The Caffe Net models with feature dimension of 4096 were trained on a NVIDIA Tesla K40c GPU. |
| Software Dependencies | No | Parse the HTML webpages, with a C++ open-source program Gumbo-Parser by Google2. |
| Experiment Setup | Yes | For multimodal embedding, we set the dimension of vector representations as 500 (we found that dimensions between 100 and 1000 give similar results) according to the recommendation from [Frome et al., 2013]. For structure learning, we tune the λ parameter in Equation (7) on training data such that the number of non-zero elements is around 100 for the θ parameter. |