Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Balanced Conic Rectified Flow

Authors: Kim Shin seong, Mingi Kwon, Jaeseok Jeong, Youngjung Uh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments to evaluate the effectiveness of our method. Our findings demonstrate: Superiority over original reflow in terms of (1) Quality of the results, (2) Straightness of the flow, (3) Mitigation of distribution shift, as well as (4) Ablation study, (5) Generalization to other datasets. Our method achieves better FID and IS scores across all sampling steps, i.e., 1-step, few-step, and full-step generations, as shown in Table 1 and Figure 6.
Researcher Affiliation	Academia	Shin seong Kim Yonsei University EMAIL Mingi Kwon Yonsei University EMAIL Jaeseok Jeong Yonsei University EMAIL Youngjung Uh Yonsei University EMAIL
Pseudocode	Yes	The full pseudocode for our training method is provided in Appendix K.
Open Source Code	No	We provide a partial implementation with core components in Appendix M, and plan to release the full codebase and instructions by the camera-ready deadline.
Open Datasets	Yes	Experimental setup Most of our experiments are conducted on CIFAR-10 [20]. On Image Net 64 64 [7], our method consistently improves unconditional generation quality over the original model. In this section, we assess the generalizability of our method on the LSUN Bedroom dataset [52] at a resolution of 256 256.
Dataset Splits	No	The IVD, curvature, reconstruction, and perturbed (0.05ε, 1-step) reconstruction error values reported were computed using 10,000 random samples, with the expectation taken over these samples. While the paper mentions using 10,000 random samples for specific evaluations, it does not explicitly provide the training/test/validation splits for the main datasets (CIFAR-10, Image Net, LSUN Bedroom) in the provided main text. It mentions data used in training (e.g., '300K fake pairs' for CIFAR-10) but not the initial dataset splits.
Hardware Specification	No	We provide the details of our CPU and GPU resources in the appendix A. Despite GPU limitations on the larger LSUN dataset, our method consistently outperformed the original in image quality. The main text refers to GPU resources being detailed in Appendix A, which is not provided. It mentions 'GPU limitations' but does not specify any exact GPU models or other hardware details in the main body of the paper.
Software Dependencies	No	We employ Scipy s RK45[46], a 5(4) Runge-Kutta method with adaptive step size and step count determined by specified tolerances, following the same parameters [37]. The paper mentions 'Scipy s RK45' but does not provide a specific version number for Scipy, which is required for a reproducible description of software dependencies.
Experiment Setup	Yes	For CIFAR-10 and Image Net, we set ζmax to 0.13 and 0.23, respectively. Each conic is trained to progressively reduce the noise scale over time. Specifically, the Slerp noise schedule ζ(t ) is defined as ζ(t ) := ζmax 2t 2 1+t 2 , t [0, 1], where t = 1 corresponds to the start of training and t = 0 to the end. All experiments are conducted with a batch size of 256, and training is performed for 300K iterations. Each setting uses 300K fake pairs and 60K real pairs for training on Cifar 10. We use the same hyperparameters, time schedule, and EMA settings as in the experiments by Liu et al. [27].