Vision-Language Fusion for Object Recognition
Authors: Sz-Rung Shiang, Stephanie Rosenthal, Anatole Gershman, Jaime Carbonell, Jean Oh
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we achieve up to 9.4% and 16.6% accuracy improvements using the oracle and the detected bounding boxes, respectively, over the vision-only recognizers. |
| Researcher Affiliation | Academia | Sz-Rung Shiang, Stephanie Rosenthal, Anatole Gershman, Jaime Carbonell, Jean Oh School of Computer Science, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, Pennsylvania, 15213 sshiang@andrew.cmu.edu, {srosenth, anatoleg, jgc, jeanoh}@cs.cmu.edu |
| Pseudocode | No | The paper describes the Multi Rank algorithm descriptively and with equations, but does not include a structured pseudocode block or a figure explicitly labeled 'Algorithm'. |
| Open Source Code | No | The paper does not provide any specific repository link or explicit statement about the release of its source code. |
| Open Datasets | Yes | We validate our algorithm on the NYU Depth V2 datasets (Silberman et al. 2012). |
| Dataset Splits | Yes | Using 5-fold cross validation, this vision-only model achieves an accuracy of 0.6299 and m AP 0.7240 in the ground-truth bounding box case and accuracy 0.4229 and m AP 0.2820 in the detected bounding box case. and 10 additional images were used for validation to tune the parameter α in Equation (3). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like Caffe, Alexnet, and SVM classifier, along with their respective citations, but it does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Multi Rank includes two parameters: α and β. Parameter α represents the informativeness of contextual information in the re-ranking process... The parameter β similarly takes the confidence score of each boxgraph into account... These parameters were tuned empirically. Figure 5 shows that the accuracy is maximized when the CV output and the contextual information are fused at around 6 : 4 ratio when 10 relations are used. |