reproducibilityindex.ai

Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

Authors: Jaemin Cho, Yushi Hu, Jason Michael Baldridge, Roopal Garg, Peter Anderson, Ranjay Krishna, Mohit Bansal, Jordi Pont-Tuset, Su Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With extensive experimentation and human evaluation on a range of model conﬁgurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above.
Researcher Affiliation	Collaboration	Jaemin Cho1 Yushi Hu2 Roopal Garg3 Peter Anderson3 Ranjay Krishna2 Jason Baldridge3 Mohit Bansal1 Jordi Pont-Tuset3 Su Wang3 1University of North Carolina at Chapel Hill 2University of Washington 3Google Research
Pseudocode	Yes	Algorithm 1 shows Python pseudocode demonstrating the T2I evaluation pipeline with DSG. Algorithm 1 Python pseudocode of T2I evaluation with DSG
Open Source Code	No	The paper states: "Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of ﬁne-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions." While it open-sources the benchmark data, it does not provide a direct link to the source code for the DSG methodology itself. The provided URL `https://google.github.io/dsg` is a project page, not a code repository. Furthermore, the reproducibility statement mentions that some LLMs and VQAs used (Pa LM2 and Pa LI) are not yet released.
Open Datasets	Yes	To further facilitate research in T2I alignment evaluation, we collect DSG-1k, a ﬁne-grained human-annotated benchmark with a diverse set of 1,060 prompts (Table 1) with a balanced distribution of semantic categories and styles (Fig. 4). The prompts are sourced from a wide range of existing public datasets. [...] Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of ﬁne-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions.
Dataset Splits	No	The paper refers to "validation" in the context of the VQA validation step and human judgment, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts) for its own method or data used for its experiments.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, processor types, memory amounts, or detailed computer specifications used for running its experiments. It mentions using LLMs and VQA models, but not the underlying hardware.
Software Dependencies	No	The paper mentions several LLM and VQA models (e.g., Pa LM 2, GPT-3.5/4, Pa LI, m PLUG-large, Instruct-BLIP) by name and sometimes model size, but it does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	No	The paper describes a three-step pipeline for automatic DSG generation and mentions preamble engineering in Appendix A, but it does not provide specific experimental setup details such as concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) or explicit training configurations for the models or the DSG pipeline in the main text.