Unifying Image Processing as Visual Prompting Question Answering
Authors: Yihao Liu, Xiangyu Chen, Xianzheng Ma, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experiments and Analysis Quantitative results (PSNR/SSIM) on image restoration tasks. Table 1. Quantitative results (PSNR/SSIM) on image restoration tasks. Table 2. Quantitative results on image enhancement and image edge detection. Experimental results. Illustrated in Fig. 5 and 6, Prompt GIP proficiently addresses a range of image processing tasks... |
| Researcher Affiliation | Collaboration | 1Shanghai Artificial Intelligence Laboratory 2Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 3University of Macau 4Kuaishou Technology. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes will be available at https://github.com/lyh-18/Prompt GIP. |
| Open Datasets | Yes | For the first eight types, we directly introduce corresponding distortions to the Image Net (Deng et al., 2009) dataset to create degraded-clean pairs. For dehazing, we utilize the ITS training set of RESIDE dataset (Li et al., 2018). For rain removal, we employ two types of rain addition models: Simple Rain Model and Complex Rain Model. The former is a simple additive rain model synthesized on the Image Net dataset; while the latter utilizes Rain13K (Zamir et al., 2021). For LLE, the LOL dataset (Wei et al., 2018) is adopted for training. For LLF, we apply local Laplacian filter (Aubry et al., 2014) on the expert-C retouched images of Adobe-MIT Fivek dataset (Bychkovsky et al., 2011), forming the requisite input-output pairs. The Image Net dataset forms the basis for creating input-output training pairs. |
| Dataset Splits | No | The paper mentions collecting datasets for testing and using some datasets for both training and testing (e.g., LOL, MIT-Adobe Five K), but it does not specify explicit training/validation/test dataset splits (e.g., 80/10/10 percentages or specific sample counts for each split) for its primary training datasets. |
| Hardware Specification | Yes | We use 8 Tesla V100 GPUs for training. |
| Software Dependencies | No | The paper mentions using a "vanilla vision Transformer (ViT-large)" as the backbone and "AdamW" as the optimizer, but it does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used (e.g., Python 3.x). |
| Experiment Setup | Yes | During training, the model processes sequences of four 256×256 images in a Q-A-Q-A pattern, resulting in a 4×256×256 total input resolution. L1 loss is utilized as the loss function. For optimization, AdamW (Loshchilov & Hutter, 2017) optimizer with a cosine learning rate scheduler is employed. The base learning rate is 1e-4. The batch size is 48. We use 8 Tesla V100 GPUs for training. A total of 50 epochs are executed. For testing Painter and Prompt GIP, we construct 20 image prompts for each task and report the best results. |