Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Edit Flows: Variable Length Discrete Flow Matching with Sequence-Level Edit Operations

Authors: Marton Havasi, Brian Karrer, Itai Gat, Ricky T. Q. Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation. Empirically, Edit Flows show a strong and consistent improvement over fixed-length discrete flow and diffusion models (Campbell et al., 2024b; Gat et al., 2024; Shi et al., 2024) across several benchmarks, including image-to-text generation at 280M parameter scale (MS-COCO, Image Captioning 3M), code generation at 1.3B parameter scale (Human Eval, MBPP), and open-ended text benchmarks at 1.3B parameter scale (Hella Swag, ARC, PIQA, OBQA, Wino Grande).
Researcher Affiliation	Industry	Marton Havasi FAIR at Meta Brian Karrer FAIR at Meta Itai Gat FAIR at Meta Ricky T. Q. Chen FAIR at Meta
Pseudocode	Yes	Figure 13: Simplified training code for Edit Flows. The helper functions get_z and get_z_t generate noisy and target token sequences, while the training loop computes the loss and updates the model parameters. For brevity, we did not include features such as batching, conditioning on a random portion of the sequence and scaling the model outputs by the rate.
Open Source Code	No	Regarding the source-code, we are not able to publish it at this time due to our organization s policy. We hope overcome the administrative challenges and publish our code in the future.
Open Datasets	Yes	Specifically, we train from scratch on the MS COCO dataset (Lin et al. 2014; CC-BY 4.0) and an image captioning dataset containing 3M image-caption pairs. For text benchmarks, we trained our models using the DCLM baseline 1.0 (Li et al. 2024; CC-BY 4.0) dataset. For the code generation benchmarks, we used the Code Llama datamix (Roziere et al., 2023).
Dataset Splits	No	The paper mentions training on the MS COCO dataset, Image Captioning 3M dataset, DCLM baseline 1.0, and Code Llama datamix. While these are recognized datasets, the paper does not specify the train/validation/test splits used for these datasets (e.g., percentages, sample counts, or explicit references to standard splits).
Hardware Specification	Yes	All models were trained of 500,000 steps with batch size of 4096 distributed across 16 8 H100 GPUs
Software Dependencies	No	The paper mentions using the Llama architecture and Flex Attention, but does not provide specific version numbers for these or other software libraries/dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	The maximum sequence length during training is set to 1024 tokens for all models. All models were trained of 500,000 steps with batch size of 4096 distributed across 16 8 H100 GPUs... Optimizer Adam W Adam W Learning rate 3e-4 3e-4 Beta 1 0.9 0.9 Beta 2 0.95 0.95 Warmup steps 2000 2000 Learning rate schedule cosine cosine (Table 4). A beginning of each sequence in the training set is designated to be conditioning. The portion of the sequence used as conditioning is randomly chosen to be c3 where c U[0, 1]. For 10% of the sequences, we drop the conditioning... Table 5 shows the sampling parameters used for evaluation in the code benchmarks.