InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Authors: Yulu Gan, Sungwoo Park, Alexander Marcel Schubert, Anthony Philippakis, Ahmed Alaa

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our model, dubbed Instruct CV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions. Code: https://github.com/Alaa Lab/Instruct CV Demo: https://huggingface.co/spaces/alaa-lab/Instruct CV
Researcher Affiliation Collaboration Yulu Gan Peking University Sungwoo Park UC Berkeley Alexander Schubert UC Berkeley and UCSF Anthony Philippakis Broad Institute of MIT & Harvard Ahmed M. Alaa UC Berkeley and UCSF
Pseudocode No No pseudocode or algorithm block found.
Open Source Code Yes Code: https://github.com/Alaa Lab/Instruct CV
Open Datasets Yes We combine four widely-used computer vision datasets (MS-COCO [28], ADE20K [29, 30], Oxford III-Pets [31] and NYUv2 [32]) covering four vision tasks (semantic segmentation, object detection, monocular depth estimation and classification), into a single multi-task dataset D = {(xi, yi, mi)}i
Dataset Splits Yes ADE20K covers 150 semantic categories and comprises 25, 000 images of which we use 20, 000 for training, 2, 000 for validation, and 3, 000 for testing. We follow the same protocol as suggested in [18] to implement the training/test split.
Hardware Specification Yes We train Instruct CV for 20 epochs on 8 NVIDIA A100 GPUs over 10 hours.
Software Dependencies Yes The proposed model is initialized with EMA weights obtained from the Stable Diffusion checkpoint, and trained with a learning rate 10 4 without any warm-up stage.
Experiment Setup Yes We train Instruct CV for 20 epochs on 8 NVIDIA A100 GPUs over 10 hours. The training involves images with a resolution of 256 256 and incorporates data augmentation including random horizontal flipping and cropping with a batch size of 128. The proposed model is initialized with EMA weights obtained from the Stable Diffusion checkpoint, and trained with a learning rate 10 4 without any warm-up stage.