Sherlock: Scalable Fact Learning in Images

Authors: Mohamed Elhoseiny, Scott Cohen, Walter Chang, Brian Price, Ahmed Elgammal

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We applied the investigated methods on several datasets that we augmented with structured facts and a large scale dataset of > 202,000 facts and 814,000 images. Our results show the advantage of relating facts by the structure by the proposed model compared to the baselines.
Researcher Affiliation Collaboration Mohamed Elhoseiny,1,2 Scott Cohen,1 Walter Chang,1 Brian Price,1 Ahmed Elgammal2 1 Adobe Research 2 Rutgers University, Computer Science Department
Pseudocode No The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm", nor does it present structured, code-like procedural steps.
Open Source Code No The paper does not contain an explicit statement or link indicating that its source code is publicly available.
Open Datasets Yes We began our data collection by augmenting existing datasets with fact language view labels fl: PPMI (Yao and Fei-Fei 2010), Stanford40 (Yao et al. 2011), Pascal Actions (Everingham et al. ), Sports (Gupta 2009), Visual Phrases (Sadeghi and Farhadi 2011), INTERACT (Antol, Zitnick, and Parikh 2014b) datasets.
Dataset Splits Yes In this dataset, we randomly split all the annotations into an 80%-20% split, constructing sets of 647,746 (fv, fl) training pairs (with 171,269 unique fact language views fl) and 168,691 (fv, fl) testing pairs (with 58,417 unique fl), for a total of (fv, fl) 816,436 pairs, 202,946 unique fl.
Hardware Specification No The paper does not specify any particular hardware components such as GPU or CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies No The paper mentions software like "GloVE840B model" and "theano implementation" but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For the visual encoder, The shared layers θ0 c match the architecture of the convolutional layers and pooling layer in VGG-16 named conv_1_1 until pool3, and have seven convolution layers. The subject layers θS c and predicate-object layers θP O c are two branches of convolution and pooling layers with the same architecture as VGG-16 layers named conv_4_1 until pool5 layer, which makes six convolution-pooling layers in each branch. Finally, θS u and θP O u are two instances of fc6 and fc7 layers in VGG16 network. W S, W P , and W O are initialized randomly and the rest are initialized from VGG-16 trained on Image Net.