Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Edit Flows: Variable Length Discrete Flow Matching with Sequence-Level Edit Operations
Authors: Marton Havasi, Brian Karrer, Itai Gat, Ricky T. Q. Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation. Empirically, Edit Flows show a strong and consistent improvement over fixed-length discrete flow and diffusion models (Campbell et al., 2024b; Gat et al., 2024; Shi et al., 2024) across several benchmarks, including image-to-text generation at 280M parameter scale (MS-COCO, Image Captioning 3M), code generation at 1.3B parameter scale (Human Eval, MBPP), and open-ended text benchmarks at 1.3B parameter scale (Hella Swag, ARC, PIQA, OBQA, Wino Grande). |
| Researcher Affiliation | Industry | Marton Havasi FAIR at Meta Brian Karrer FAIR at Meta Itai Gat FAIR at Meta Ricky T. Q. Chen FAIR at Meta |
| Pseudocode | Yes | Figure 13: Simplified training code for Edit Flows. The helper functions get_z and get_z_t generate noisy and target token sequences, while the training loop computes the loss and updates the model parameters. For brevity, we did not include features such as batching, conditioning on a random portion of the sequence and scaling the model outputs by the rate. |
| Open Source Code | No | Regarding the source-code, we are not able to publish it at this time due to our organization s policy. We hope overcome the administrative challenges and publish our code in the future. |
| Open Datasets | Yes | Specifically, we train from scratch on the MS COCO dataset (Lin et al. 2014; CC-BY 4.0) and an image captioning dataset containing 3M image-caption pairs. For text benchmarks, we trained our models using the DCLM baseline 1.0 (Li et al. 2024; CC-BY 4.0) dataset. For the code generation benchmarks, we used the Code Llama datamix (Roziere et al., 2023). |
| Dataset Splits | No | The paper mentions training on the MS COCO dataset, Image Captioning 3M dataset, DCLM baseline 1.0, and Code Llama datamix. While these are recognized datasets, the paper does not specify the train/validation/test splits used for these datasets (e.g., percentages, sample counts, or explicit references to standard splits). |
| Hardware Specification | Yes | All models were trained of 500,000 steps with batch size of 4096 distributed across 16 8 H100 GPUs |
| Software Dependencies | No | The paper mentions using the Llama architecture and Flex Attention, but does not provide specific version numbers for these or other software libraries/dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The maximum sequence length during training is set to 1024 tokens for all models. All models were trained of 500,000 steps with batch size of 4096 distributed across 16 8 H100 GPUs... Optimizer Adam W Adam W Learning rate 3e-4 3e-4 Beta 1 0.9 0.9 Beta 2 0.95 0.95 Warmup steps 2000 2000 Learning rate schedule cosine cosine (Table 4). A beginning of each sequence in the training set is designated to be conditioning. The portion of the sequence used as conditioning is randomly chosen to be c3 where c U[0, 1]. For 10% of the sequences, we drop the conditioning... Table 5 shows the sampling parameters used for evaluation in the code benchmarks. |