UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis

Authors: Zhu Zhang, Jianxin Ma, Chang Zhou, Rui Men, Zhikang Li, Ming Ding, Jie Tang, Jingren Zhou, Hongxia Yang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on a newly collected large-scale clothing dataset M2C-Fashion and a facial dataset Multi Modal Celeb A-HQ verify that UFC-BERT can synthesize high-fidelity images that comply with flexible multi-modal controls.
Researcher Affiliation Collaboration Zhu Zhang , Jianxin Ma , Chang Zhou , Rui Men , Zhikang Li , Ming Ding , Jie Tang , Jingren Zhou , and Hongxia Yang DAMO Academy, Alibaba Group, Tsinghua University {zhangzhu950310}@gmail.com {jason.mjx, ericzhou.zc, yang.yhx}@alibaba-inc.com
Pseudocode No The paper describes the algorithms like Mask-Predict and Progressive Non-Autoregressive Generation (PNAG) in detail but does not provide them in structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We additionally use another high-resolution facial dataset Multi-Modal Celeb A-HQ [28, 61].
Dataset Splits No The paper mentions using two datasets (M2C-Fashion and Multi-Modal Celeb A-HQ) but does not provide specific details on how these datasets were split into training, validation, and test sets for reproducibility.
Hardware Specification Yes We evaluate speed on the same V100 GPU.
Software Dependencies No The paper does not provide specific software dependencies or version numbers (e.g., Python, PyTorch, TensorFlow versions) used for its implementation or experiments.
Experiment Setup Yes For the BERT model, we set the number of layers, hidden size, and the number of attention heads to 24, 1024, and 16, respectively. Our UFC-BERT has 307M parameters, same as the Transformer used by VQGAN. As for hyper-parameters of PNAG, we set the parallel decoding number B to 5 and the balance coefficient σ to 0.5. We set the initial mask ratio α, the minimum mask ratio β, and the maximum iteration number T to 0.8, 0.2, and 10, respectively.