reproducibilityindex.ai

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Authors: Yulu Gan, Sungwoo Park, Alexander Marcel Schubert, Anthony Philippakis, Ahmed Alaa

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that our model, dubbed Instruct CV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions. Code: https://github.com/Alaa Lab/Instruct CV Demo: https://huggingface.co/spaces/alaa-lab/Instruct CV
Researcher Affiliation	Collaboration	Yulu Gan Peking University Sungwoo Park UC Berkeley Alexander Schubert UC Berkeley and UCSF Anthony Philippakis Broad Institute of MIT & Harvard Ahmed M. Alaa UC Berkeley and UCSF
Pseudocode	No	No pseudocode or algorithm block found.
Open Source Code	Yes	Code: https://github.com/Alaa Lab/Instruct CV
Open Datasets	Yes	We combine four widely-used computer vision datasets (MS-COCO [28], ADE20K [29, 30], Oxford III-Pets [31] and NYUv2 [32]) covering four vision tasks (semantic segmentation, object detection, monocular depth estimation and classification), into a single multi-task dataset D = {(xi, yi, mi)}i
Dataset Splits	Yes	ADE20K covers 150 semantic categories and comprises 25, 000 images of which we use 20, 000 for training, 2, 000 for validation, and 3, 000 for testing. We follow the same protocol as suggested in [18] to implement the training/test split.
Hardware Specification	Yes	We train Instruct CV for 20 epochs on 8 NVIDIA A100 GPUs over 10 hours.
Software Dependencies	Yes	The proposed model is initialized with EMA weights obtained from the Stable Diffusion checkpoint, and trained with a learning rate 10 4 without any warm-up stage.
Experiment Setup	Yes	We train Instruct CV for 20 epochs on 8 NVIDIA A100 GPUs over 10 hours. The training involves images with a resolution of 256 256 and incorporates data augmentation including random horizontal flipping and cropping with a batch size of 128. The proposed model is initialized with EMA weights obtained from the Stable Diffusion checkpoint, and trained with a learning rate 10 4 without any warm-up stage.