Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Align Your Flow: Scaling Continuous-Time Flow Map Distillation

Authors: Amirmojtaba Sabour, Sanja Fidler, Karsten Kreis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively validate our flow map models, called Align Your Flow, on challenging image generation benchmarks and achieve state-of-the-art few-step generation performance on both Image Net 64x64 and 512x512, using small and efficient neural networks. Finally, we show text-to-image flow map models that outperform all existing non-adversarially trained few-step samplers in text-conditioned synthesis. We validate AYF on popular image generation benchmarks and achieve state-of-the-art performance among few-step generators on both Image Net 64x64 and 512x512, while using only small and efficient neural networks (Fig. 4). We train AYF flow maps on Image Net [10] at resolutions 64 64 and 512 512, measuring sample quality using Fréchet Inception Distance (FID) [23], as previous works. We also use our AYF framework to distill FLUX.1 [dev] [41], the best text-to-image diffusion model, using an efficient Lo RA [27] framework and reduce sampling steps to just 4. Experiment details explained in the Appendix.
Researcher Affiliation Collaboration Amirmojtaba Sabour1,2,3 Sanja Fidler1,2,3 Karsten Kreis1 1 NVIDIA 2 University of Toronto 3 Vector Institute
Pseudocode Yes Algorithm 1 Flow Map Distillation with AYF-EMD Loss. Algorithm 2 Adversarial Flow Map Finetuning with AYF-EMD and Adversarial losses.
Open Source Code No We plan to publicly release our code upon publication. Together with the implementation details given in the paper, our results can be reproduced. We will only be publicly releasing our small-scale Image Net codebase which does not pose any safety risks.
Open Datasets Yes We train AYF flow maps on Image Net [10] at resolutions 64 64 and 512 512, measuring sample quality using Fréchet Inception Distance (FID) [23], as previous works. We train our model using the text-to-image-2M dataset [30] from Hugging Face, which contains over 2 million real and synthetic images. Image Net Dataset: Used for our main experiments. Distributed under a non-commercial research license. Text-to-Image-2M Dataset (https://huggingface.co/datasets/jackyhate/ text-to-image-2M): Used to train our distilled text-to-image Lo RAs. Licensed under the MIT License.
Dataset Splits No The paper does not explicitly state training/test/validation dataset splits for the ImageNet or text-to-image-2M datasets used in the main experiments. It describes timestep sampling for training and a 'holdout set of 200 prompts' for a user study, but not formal dataset splits for model evaluation.
Hardware Specification Yes These experiments were performed using 32 NVIDIA A100 gpus and took approximately 24-48 hours to converge. Finetuning is run for approximately 3000 iterations using 32 NVIDIA A100 GPUs, taking around 4 hours in total. This distillation process took approx. four hours on 8 NVIDIA A100 GPUs, which is highly efficient, in contrast to several previous large-scale text-to-image distillation methods.
Software Dependencies No The paper mentions the use of 'Py Torch' generally and lists several codebases/libraries (EDM2, Style GAN3, Style GAN2, Diffusers) in the 'Licenses' section, but does not specify version numbers for any of these key software components, nor for the programming language (e.g., Python version) or CUDA.
Experiment Setup Yes For our Image Net experiments, we use publicly available checkpoints from EDM2 [36]. These models are first fine-tuned to align with the flow matching framework (see Sec. 3.4 for details) before being used as teacher models to distill a flow map. We run this finetuning stage for 10, 000 steps using a learning rate of 0.001. In all experiments, we apply tangent normalization and tangent warmup, following the approach introduced in s CM [49], setting c = 0.1 and H = 10000. We use a learning rate of 10 4 and a batch size of 2048 for all experiments for a total of 50, 000 iterations. For adversarial finetuning, we use the Style GAN2 discriminator [33] and follow the relativistic pairing GAN (Rp GAN) formulation [28, 31]. We use a learning rate of 2 10 5 for both networks and a batch size of 1024. Finetuning is run for approximately 3000 iterations.