Diverse Beam Search for Improved Description of Complex Scenes
Authors: Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, Dhruv Batra
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our method improves both diversity and quality of decoded sequences over existing techniques on two visually-grounded language generation tasks image captioning and visual question generation particularly on complex scenes containing diverse visual content. We also show similar improvements at language-only machine translation tasks, highlighting the generality of our approach. and 5 Experiments In this section, we evaluate our approach on image captioning, visual question generation and machine translation tasks to demonstrate both its effectiveness against baselines and its general applicability to any inference currently supported by beam search. Further, we explore the role of diversity in generating language from complex images. |
| Researcher Affiliation | Collaboration | Ashwin K Vijayakumar,1 Michael Cogswell,1 Ramprasaath R Selvaraju,1 Qing Sun,2 Stefan Lee,1 David Crandall,3 Dhruv Batra1,4 1Georgia Tech, 2Virginia Tech, 3Indiana University, 4Facebook AI Research |
| Pseudocode | Yes | Algorithm 1: Diverse Beam Search |
| Open Source Code | Yes | To aid transparency and reproducibility, our code for DBS is available at https://github.com/ashwinkalyan/dbs. |
| Open Datasets | Yes | We begin by validating our approach on the COCO (Lin et al. 2014) image captioning task and PASCAL-50S (Vedantam, Lawrence Zitnick, and Parikh 2015) dataset and We use the English-French parallel data from the europarl corpus as the training set. |
| Dataset Splits | Yes | We use the public splits as in (Karpathy and Fei-Fei 2015) and train a captioning model (Vinyals et al. 2015) using the neuraltalk2 codebase. We compare decoding methods on this model. and We keep 200 random images as a validation set for tuning and evaluate on the remaining images. and We report results on news-test-2013 and news-test-2014 and use the newstest2012 to tune DBS parameters. |
| Hardware Specification | No | The paper describes the models used (CNN, RNNs, LSTMs) and the experimental setup, but it does not specify any particular hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | We use the public splits as in (Karpathy and Fei-Fei 2015) and train a captioning model (Vinyals et al. 2015) using the neuraltalk2 codebase. We compare decoding methods on this model. and We train a encoder-decoder architecture as proposed in (Bahdanau, Cho, and Bengio 2014) using the dl4mt-tutorial2 code repository. |
| Experiment Setup | Yes | We set all hyperparameters for DBS and the baseline methods by maximizing oracle SPICE via gridsearch on a held out validation set for each experiment. and We set λ via grid search over a range of values to maximize oracle accuracies achieved on the validation set. We find a wide range of λ values (0.2 to 0.8) work well for most tasks and datasets with which we experimented. |