Vision-Language Fusion for Object Recognition

Authors: Sz-Rung Shiang, Stephanie Rosenthal, Anatole Gershman, Jaime Carbonell, Jean Oh

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we achieve up to 9.4% and 16.6% accuracy improvements using the oracle and the detected bounding boxes, respectively, over the vision-only recognizers.
Researcher Affiliation Academia Sz-Rung Shiang, Stephanie Rosenthal, Anatole Gershman, Jaime Carbonell, Jean Oh School of Computer Science, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, Pennsylvania, 15213 sshiang@andrew.cmu.edu, {srosenth, anatoleg, jgc, jeanoh}@cs.cmu.edu
Pseudocode No The paper describes the Multi Rank algorithm descriptively and with equations, but does not include a structured pseudocode block or a figure explicitly labeled 'Algorithm'.
Open Source Code No The paper does not provide any specific repository link or explicit statement about the release of its source code.
Open Datasets Yes We validate our algorithm on the NYU Depth V2 datasets (Silberman et al. 2012).
Dataset Splits Yes Using 5-fold cross validation, this vision-only model achieves an accuracy of 0.6299 and m AP 0.7240 in the ground-truth bounding box case and accuracy 0.4229 and m AP 0.2820 in the detected bounding box case. and 10 additional images were used for validation to tune the parameter α in Equation (3).
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions software like Caffe, Alexnet, and SVM classifier, along with their respective citations, but it does not provide specific version numbers for these software dependencies.
Experiment Setup Yes Multi Rank includes two parameters: α and β. Parameter α represents the informativeness of contextual information in the re-ranking process... The parameter β similarly takes the confidence score of each boxgraph into account... These parameters were tuned empirically. Figure 5 shows that the accuracy is maximized when the CV output and the contextual information are fused at around 6 : 4 ratio when 10 relations are used.