The Generative AI Paradox: “What It Can Create, It May Not Understand”
Authors: Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, Jena D. Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, Pang Wei Koh, Allyson Ettinger, Yejin Choi
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test this hypothesis through controlled experiments analyzing generation and understanding capabilities in generative models, across language and visual modalities. Our results show that although models can outperform humans in generation, they consistently fall short of human capabilities in measures of understanding, showing weaker correlation between generation and understanding performance, and more brittleness to adversarial inputs. |
| Researcher Affiliation | Collaboration | 1University of Washington 2Allen Institute for Artificial Intelligence {pawest,linjli}cs.washington.edu {ximinglu,nouhad,faezeb}allenai.org |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states, "All datasets and models we use here are public or can be accessed through public interfaces." However, it does not provide an explicit statement or link to the authors' own source code for the methodology described in the paper. It refers to using existing public models and APIs, but not their own implementation code. |
| Open Datasets | Yes | Language benchmarks. For dialogue, we explore two open-ended datasets Mutual+ (Cui et al., 2020) and DREAM (Sun et al., 2019), and a document-grounded benchmark, Faithdial (Dziri et al., 2022). ... Vision benchmarks. From COCO (Lin et al., 2014), Paint Skill (Cho et al., 2022), Draw Bench (Saharia et al., 2022) and T2IComp Bench (Huang et al., 2023). |
| Dataset Splits | No | The paper specifies sizes for test sets (e.g., "500 test examples", "100 text prompts per dataset") but does not explicitly provide training/validation/test dataset splits with specific percentages or counts for all three, or reference to predefined splits for all datasets used. |
| Hardware Specification | No | The paper mentions using "Open AI API" for GPT models and models available on "Hugging Face" but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using "Open AI API" and models from "Hugging Face (Wolf et al., 2019)" but does not provide specific version numbers for software dependencies beyond the year in the citation for Hugging Face or general API usage. |
| Experiment Setup | Yes | During inference, we set nucleus sampling p to 1 and temperature to 1. For each task, we evaluate the performance of each model on 500 test examples. |