Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Authors: Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Jiaze Chen, Xuefeng Li, Qiying Yu, Hao Zhou, Mingxuan Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce ENIGMATA, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across 7 categories, each with: 1) a generator that produces unlimited examples with controllable difficulty, and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose ENIGMATA-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-ENIGMATA, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like ENIGMATA-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off.
Researcher Affiliation	Collaboration	Jiangjie Chen1,5, , Qianyu He1,2, , , Siyu Yuan1,2, , , Aili Chen2, Zhicheng Cai3,5, Weinan Dai1,3,5, , Hongli Yu1,3,5, , Qiying Yu1,3,5, , Xuefeng Li1,4, Jiaze Chen1,5, Hao Zhou3,5, Mingxuan Wang1,5 1Byte Dance Seed 2Fudan University 3Institute for AI Industry Research (AIR), Tsinghua University 4Shanghai Jiao Tong University 5SIA-Lab of Tsinghua AIR and Byte Dance Seed
Pseudocode	No	The paper describes methods such as 'Data Construction', 'Rejection Fine-tuning', and 'RL with Verifiable Puzzles' in detail using prose. It also outlines rules for various puzzles in Appendix F.3. However, it does not include any clearly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code	No	Project page: https://seed-enigmata.github.io. Open access to data and code: We will release the resources upon paper’s publication.
Open Datasets	Yes	The whole ENIGMATA suite comprises of ENIGMATA-Data, ENIGMATA-Eval, and ENIGMATA-Model. Based on ENIGMATA-Data, we present the ENIGMATA-Eval benchmark, a diverse collection of puzzles... We also include the training data of the ARC-AGI puzzle [9; 27] in the RFT data... The public training set of ARC-AGI 1 and 2... AIME mathematical problems (1983-2023)... Project page: https://seed-enigmata.github.io.
Dataset Splits	Yes	We develop ENIGMATA-Eval by systematically sampling from our broader dataset. For each task, we aimed to extract 50 instances per difficulty level (Easy, Medium, Hard). However, due to inherent constraints in some tasks, we collected a total of 4,758 puzzle instances rather than the theoretical maximum of 5,400. This discrepancy arises because some tasks generate fewer than 50 instances per difficulty level, while others rely on manually collected and annotated data rather than auto-generation. Importantly, we ensured no data leakage between training and evaluation sets by implementing strict separation protocols during the sampling process. Appendix B: Rejection Fine-tuning. For the puzzle part of the Rejection Fine-tuning (RFT) dataset, we sample 1,000 instances from each task in the ENIGMATA dataset. We also include synthetic ARC-AGI data4. ... The final puzzle dataset contains 12,041 high-quality puzzle samples. For the mathematical part of the RFT dataset, we collected mathematical problems from light-R1 [28] ... resulting in a total of 12,533 mathematical samples. ... As for the Mix-Training, the dataset for this approach consists of three components: (1) ENIGMATA-Train: 400 samples per task with equal distribution across difficulty levels (2) ARCAGI 1 and 2: Official datasets upsampled 8x to address specific reasoning challenges (3) AIME problems from 1983-2023 upsampled 2x as the mathematical component We maintained a 1:1 puzzleto-math ratio throughout training to ensure balanced exposure to different reasoning types. Table 9: Training data distribution across different strategies and stages.
Hardware Specification	No	The paper does not explicitly mention specific hardware details such as GPU models, CPU models, or specific cloud instances used for running the experiments. While it refers to 'prohibitive resources required to train a 20B/200B model' in Section 5.3, it does not specify the nature of these resources.
Software Dependencies	No	Appendix C states: "We adopt a variant of Proximal Policy Optimization (PPO) [30], i.e., VC-PPO [29], to train our reasoning agent on verifiable puzzles." and "Our implementation is based on the VeRL 5 framework." It also mentions "We leverage vLLM [37] for efficient batched decoding". While these are specific tools and algorithms, the paper does not provide version numbers for general software components (like Python, PyTorch, CUDA) or for the mentioned frameworks (VeRL, vLLM).
Experiment Setup	Yes	Appendix C: We fine-tune the model on this balanced dataset for 2 epochs using a maximum sequence length of 32768 tokens and a learning rate of 1e-5. ... PPO training is conducted for 425 steps with a batch size of 4,096 and a mini-batch size of 512. The actor and critic are optimized using Adam, with learning rates of 1e-6 and 2e-6, respectively, and a linear warm-up schedule over 10 steps. Before PPO begins, we perform value pretraining [29] for 15 steps... Rollouts are generated using temperature sampling (τ = 1.0), with enforced end-of-sequence tokens.