Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation
Authors: Jaemin Cho, Yushi Hu, Jason Michael Baldridge, Roopal Garg, Peter Anderson, Ranjay Krishna, Mohit Bansal, Jordi Pont-Tuset, Su Wang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above. |
| Researcher Affiliation | Collaboration | Jaemin Cho1 Yushi Hu2 Roopal Garg3 Peter Anderson3 Ranjay Krishna2 Jason Baldridge3 Mohit Bansal1 Jordi Pont-Tuset3 Su Wang3 1University of North Carolina at Chapel Hill 2University of Washington 3Google Research |
| Pseudocode | Yes | Algorithm 1 shows Python pseudocode demonstrating the T2I evaluation pipeline with DSG. Algorithm 1 Python pseudocode of T2I evaluation with DSG |
| Open Source Code | No | The paper states: "Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions." While it open-sources the benchmark data, it does not provide a direct link to the source code for the DSG methodology itself. The provided URL `https://google.github.io/dsg` is a project page, not a code repository. Furthermore, the reproducibility statement mentions that some LLMs and VQAs used (Pa LM2 and Pa LI) are not yet released. |
| Open Datasets | Yes | To further facilitate research in T2I alignment evaluation, we collect DSG-1k, a fine-grained human-annotated benchmark with a diverse set of 1,060 prompts (Table 1) with a balanced distribution of semantic categories and styles (Fig. 4). The prompts are sourced from a wide range of existing public datasets. [...] Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions. |
| Dataset Splits | No | The paper refers to "validation" in the context of the VQA validation step and human judgment, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts) for its own method or data used for its experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, memory amounts, or detailed computer specifications used for running its experiments. It mentions using LLMs and VQA models, but not the underlying hardware. |
| Software Dependencies | No | The paper mentions several LLM and VQA models (e.g., Pa LM 2, GPT-3.5/4, Pa LI, m PLUG-large, Instruct-BLIP) by name and sometimes model size, but it does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | No | The paper describes a three-step pipeline for automatic DSG generation and mentions preamble engineering in Appendix A, but it does not provide specific experimental setup details such as concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) or explicit training configurations for the models or the DSG pipeline in the main text. |