reproducibilityindex.ai

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

Authors: Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, Omer Levy

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image prompts and real users preferences over generated images. We leverage this dataset to train a CLIP-based scoring function, Pick Score, which exhibits superhuman performance on the task of predicting human preferences. Then, we test Pick Score s ability to perform model evaluation and observe that it correlates better with human rankings than other automatic evaluation metrics.
Researcher Affiliation	Collaboration	Yuval Kirstainτ Adam Polyakτ Uriel Singer Shahbuland Matianaσ Joe Pennaσ Omer Levyτ τ Tel Aviv University σ Stability AI
Pseudocode	No	The paper describes the model and objective using mathematical formulas but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The model is available at https://huggingface.co/yuvalkirstain/Pick Score_v1
Open Datasets	Yes	The Pick-a-Pic dataset4 was created by logging user interactions with the Pick-a-Pic web application for text-to-image generation. Overall, the Pick-a-Pic dataset contains over 500,000 examples and 35,000 distinct prompts. [...] 4The dataset is available at https://huggingface.co/datasets/yuvalkirstain/pickapic_v1, and its updated version is available at https://huggingface.co/datasets/yuvalkirstain/pickapic_v2.
Dataset Splits	Yes	To divide the dataset into training, validation, and testing subsets, we first sample one thousand prompts, ensuring that each prompt was created by a unique user. Next, we randomly divide those prompts into two sets of equal size to create the validation and test sets. We then sample exactly one example for each prompt to include in these sets. For the training set, we include all examples that do not share a prompt with the validation and test sets. This approach ensures that no split shares prompts with another split, and the validation and test sets do not suffer from being non-proportionally fitted to a specific prompt or user. [...] that contains 583,747 training examples, and 500 validation and test examples.
Hardware Specification	Yes	the experiment is completed in less than an hour with 8 A100 GPUs.
Software Dependencies	No	The paper mentions the use of 'CLIP-H' and 'Instruct GPT’s reward model objective', which are model architectures or objectives. However, it does not provide specific version numbers for software libraries, frameworks, or programming languages used (e.g., PyTorch version, Python version).
Experiment Setup	Yes	We train the model for 4,000 steps, with a learning rate of 3e-6, a total batch size of 128, and a warmup period of 500 steps, which follows a linearly decaying learning rate