AltDiffusion: A Multilingual Text-to-Image Diffusion Model
Authors: Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on both MG-18 and MC-18 demonstrate that Alt Diffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. |
| Researcher Affiliation | Academia | Fulong Ye1,2* , Guang liu2* , Xinya Wu2, Ledell Wu2 1 Beijing University of Posts and Telecommunications, Beijing, China 2 Beijing Academy of Artificial Intelligence fulong ye@bupt.edu.cn {liuguang, yxwu, wuyu}@baai.ac.cn |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. The paper describes the methods in narrative text and equations. |
| Open Source Code | Yes | All source code and checkpoints could be found in https://github.com/superhero-7/Alt Diffuson. |
| Open Datasets | Yes | All the image-text pairs we use to train AD come from LAION (Schuhmann et al. 2022). LAION 5B LAION 5B includes three sub-datasets: LAION2B-en, LAION2B-multi and LAION1B-nolang... LAION Aesthetics LAION Aesthetics contains several collections of subsets from LAION 5B with high quality. An Aesthetics Predictor is trained using LAION to predict the aesthetics score of images on a scale of 1 to 10, with higher aesthetics score being better. Then the Aesthetics Predictor is used for filtering the data. To conduct the second training stage, we filter eighteen languages from the LAION Aesthetics and the LAION Aesthetics V1-multi dataset with the predicted aesthetics score higher than seven. |
| Dataset Splits | No | The paper describes datasets used for training (LAION 5B, LAION Aesthetics) and for evaluation (MG-18, MC-18), but does not provide explicit details about train/validation/test splits of the primary training datasets (LAION) themselves for reproducing the training process. |
| Hardware Specification | Yes | The learning rate is 1e-4, with 10,000 warmup steps on 64 NVIDIA A100-SXM4-40GB GPUs. |
| Software Dependencies | No | The paper mentions specific models (e.g., 'SD v2.1 512-base-ema checkpoint', 'Open CLIP Vi T-H-14', 'XLM-R', 'NLLB-3B') and optimization techniques/libraries ('Adam W', 'Xformer', 'Efficient Attention'), but it does not provide specific version numbers for general software dependencies such as Python, PyTorch, or other common libraries. |
| Experiment Setup | Yes | The optimizer is Adam W(Loshchilov and Hutter 2017). The learning rate is 1e-4, with 10,000 warmup steps on 64 NVIDIA A100-SXM4-40GB GPUs... In the concept align stage, we use the SD v2.1 512-base-ema checkpoint to initialize all parameters except the text encoder, with a batch size of 3,072 and a resolution of 256x256... In the quality improvement stage... with a batch size of 3,840 on LAION Aesthetics V1-en and V1-multi and 270,000-steps with a resolution of 512x512... The decay of EMA is 0.9999. |