$\textit{Bifr\"ost}$: 3D-Aware Image Compositing with Language Instructions

Authors: Lingxiao Li, Kaixiong Gong, Wei-Hong Li, xili dai, Tao Chen, Xiaojun Yuan, Xiangyu Yue

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive qualitative and quantitative evaluations demonstrate that Bifröst significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding.
Researcher Affiliation Academia 1MMLab, The Chinese University of Hong Kong 2The Hong Kong University of Science and Technology (Guangzhou) 3Fudan University 4University of Electronic Science and Technology of China
Pseudocode No The paper includes pipeline diagrams and descriptive text for its methods but does not provide formal pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/lingxiao-li/Bifrost
Open Datasets Yes To this end, we build a novel counterfactual dataset based on MS COCO dataset (Lin et al., 2014). [...] The full data used is listed in Tab. 1, which covers a large variety of domains such as nature scenes (SAM (Kirillov et al., 2023), LVIS (Gupta et al., 2019), HFlickr (Cong et al., 2020), DUT (Wang et al., 2017), and MSRA-10K (Borji et al., 2015)), panoptic video segmentation datasets (Youtube VOS (Xu et al., 2018), VIPSeg (Miao et al., 2022), and MOSE (Ding et al., 2023)), and virtual try-on dataset (Viton HD (Choi et al., 2021)).
Dataset Splits No The paper states 'We collect 30,080 image-instruction-answer pairs for training and 855 for testing' for MLLM, and then describes data for image compositing training and evaluation on specific test sets. However, it does not explicitly provide details for a separate validation split used during its own training phases, nor explicit split percentages or counts for training/validation/test sets for the image compositing model beyond the test sets used for evaluation.
Hardware Specification Yes we train 15 epochs with 4 A800 GPUs, which takes about 5 hours to finish the fine-tuning. (...) we train 20k steps on 4 A800 GPUs, which takes 1 day to finish the training. (...) Our model can run on a single NVIDIA RTX 4090 GPU (24GB) thanks to our two-stage training/inference since one does not need to load all models simultaneously.
Software Dependencies Yes We choose LLa VA (Liu et al., 2023a, 2024) as our method to fine-tune multimodal large language models and Vicuna (Chiang et al., 2023) as the LLM. We choose Stable Diffusion V2.1 (Rombach et al., 2022) as the base generator for the image compositing model. We use pre-trained CLIP-Vi T-large (Radford et al., 2021) as the visual encoder. We leverage pre-trained DINO-V2 (Oquab et al., 2023) to encode images into visual tokens. We use a pre-trained depth predictor DPT (Ranftl et al., 2021).
Experiment Setup Yes Hyperparameters. We choose LLa VA (Liu et al., 2023a, 2024) as our method to fine-tune multimodal large language models and Vicuna (Chiang et al., 2023) as the LLM. The learning rate is set as 2e 5 and train 15 epochs. We choose Stable Diffusion V2.1 (Rombach et al., 2022) as the base generator for the image compositing model. During training, we set the image resolution to 512 512. We choose Adam (Kingma and Ba, 2014) optimizer with an initial learning rate of 1e 5. The batch size is set as 16, we train 20k steps on 4 A800 GPUs.