RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models

Authors: Haoyu Chen, Wenbo Li, JINJIN GU, Jingjing Ren, Sixiang Chen, Tian Ye, Renjing Pei, Kaiwen Zhou, Fenglong Song, Lei Zhu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the superior performance of Restore Agent in handling complex degradation, surpassing human experts. Furthermore, the system s modular design facilitates the fast integration of new tasks and models.
Researcher Affiliation Collaboration Haoyu Chen1, Wenbo Li2, Jinjin Gu3, Jingjing Ren1, Sixiang Chen1, Tian Ye1, Renjing Pei2, Kaiwen Zhou2, Fenglong Song2, Lei Zhu1,4 1The Hong Kong University of Science and Technology (Guangzhou) 2Huawei Noah s Ark Lab 3The University of Sydney 4The Hong Kong University of Science and Technology
Pseudocode No The paper describes the pipeline and processes in text and flowcharts (e.g., Figure 3) but does not include any formal pseudocode blocks or algorithms.
Open Source Code Yes Project page: https://haoyuchen.com/Restore Agent
Open Datasets Yes To fully leverage the potential of multimodal large models, we construct a substantial dataset of paired training samples. The process begins with applying various types of degradation to an image. Subsequently, we determine the optimal restoration pipeline using model tools for processing. For each image undergoing multiple degradations, a comprehensive search is conducted to identify the best restoration pipeline, as shown in Figure 3. This involves generating all possible permutations of task execution sequences and model combinations, applying each pipeline to the degraded image, and assessing the quality of the restored outputs using a scoring function S(I, σ). This part of the data exceeds 23k pairs.
Dataset Splits No The paper describes the 'training datasets' and 'testing datasets' but does not explicitly mention or specify a distinct 'validation' dataset or its split. For example, in Section 4.1, it states: 'For the testing datasets, we assemble 200 images, mirroring the degradation types found in the training datasets, to facilitate evaluation.'
Hardware Specification Yes The Restore Agent undergoes training across ten epochs on 4 NVIDIA RTX A100 GPUs, with a batch size of 32. We employ the Adam optimizer and a learning rate of 0.00002. The total duration of the training process approximates ten hours.
Software Dependencies Yes In this study, we incorporate the CLIP pre-trained Vision Transformer (Vi T-L/14) [42] as the image encoder to convert input images into visual tokens. For the language model, we utilize the Llama37B [46]. Despite their capabilities, pre-trained LLMs fail to provide accurate responses without dataset-specific fine-tuning. To address this, we adopt Lo RA [21], a fine-tuning technique that efficiently modifies a limited number of parameters within the model. Following [21], we apply Lo RA to adjust the projection layers in all self-attention modules of both the vision encoder and the LLM, thereby generating our Restore Agent. We employ the Xtuner framework [15] to facilitate the training process.
Experiment Setup Yes For our experimental setup, we configure the Lo RA rank to 16. The Restore Agent undergoes training across ten epochs on 4 NVIDIA RTX A100 GPUs, with a batch size of 32. We employ the Adam optimizer and a learning rate of 0.00002. The total duration of the training process approximates ten hours.