UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis
Authors: Zhu Zhang, Jianxin Ma, Chang Zhou, Rui Men, Zhikang Li, Ming Ding, Jie Tang, Jingren Zhou, Hongxia Yang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on a newly collected large-scale clothing dataset M2C-Fashion and a facial dataset Multi Modal Celeb A-HQ verify that UFC-BERT can synthesize high-fidelity images that comply with flexible multi-modal controls. |
| Researcher Affiliation | Collaboration | Zhu Zhang , Jianxin Ma , Chang Zhou , Rui Men , Zhikang Li , Ming Ding , Jie Tang , Jingren Zhou , and Hongxia Yang DAMO Academy, Alibaba Group, Tsinghua University {zhangzhu950310}@gmail.com {jason.mjx, ericzhou.zc, yang.yhx}@alibaba-inc.com |
| Pseudocode | No | The paper describes the algorithms like Mask-Predict and Progressive Non-Autoregressive Generation (PNAG) in detail but does not provide them in structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We additionally use another high-resolution facial dataset Multi-Modal Celeb A-HQ [28, 61]. |
| Dataset Splits | No | The paper mentions using two datasets (M2C-Fashion and Multi-Modal Celeb A-HQ) but does not provide specific details on how these datasets were split into training, validation, and test sets for reproducibility. |
| Hardware Specification | Yes | We evaluate speed on the same V100 GPU. |
| Software Dependencies | No | The paper does not provide specific software dependencies or version numbers (e.g., Python, PyTorch, TensorFlow versions) used for its implementation or experiments. |
| Experiment Setup | Yes | For the BERT model, we set the number of layers, hidden size, and the number of attention heads to 24, 1024, and 16, respectively. Our UFC-BERT has 307M parameters, same as the Transformer used by VQGAN. As for hyper-parameters of PNAG, we set the parallel decoding number B to 5 and the balance coefficient σ to 0.5. We set the initial mask ratio α, the minimum mask ratio β, and the maximum iteration number T to 0.8, 0.2, and 10, respectively. |