Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BundleFlow: Deep Menus for Combinatorial Auctions by Diffusion-Based Optimization

Authors: Tonghan Wang, Yanchen Jiang, David C. Parkes

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate BUNDLEFLOW using single-bidder instantiations of CATS [35], a widely adopted CA testbed. Experimental results demonstrate that our method consistently and significantly outperforms all baselines across all benchmark settings and scales to auctions involving up to 500 items. For auctions with 50 to 150 items, BUNDLEFLOW achieves 1.11 2.23 higher revenue.
Researcher Affiliation	Academia	Tonghan Wang Harvard University EMAIL Yanchen Jiang Harvard University EMAIL David C. Parkes Harvard University EMAIL
Pseudocode	No	The paper describes the methodology using mathematical equations and text, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Our code is available online2. 2https://github.com/Tonghan Wang/Bundle Flow.git
Open Datasets	Yes	We evaluate our method on CATS [35], a standard benchmark in CA research.1 Consistent with previous works [5, 6, 17, 19, 26 28, 47, 49, 59, 62], we focus our experiments on Arbitrary and Regions environments, which represent the most challenging problem instances [34]. Valuations are expressed in the CATS XOR bidding language as sets of bundles paired with their corresponding values (sets of atoms). We test different numbers of items: 10, 50, 75, 100, and 150 across all environments. On the Regions environment with normal value distributions, we further test 200 and 500 items. When varying the number of items, we set the maximum XOR atoms per bid to 5 (the default value). We also experiment with increasing the maximum number of atoms to 50. Appx. D.1 discusses how to set up single-bidder auctions in CATS. All experiments are conducted on a single NVIDIA A100 GPU. Our code is available online2. 1We used the latest CATS v2.2 as is distributed under the CATS License Agreement (non-commercial research use); see https://www.cs.ubc.ca/~kevinlb/CATS/.
Dataset Splits	Yes	To obtain single-bidder valuations, we generate 100,000 such files and extract valuation functions identified by a consistent dummy item. Of these, 95% are used for training, with the remaining 5% reserved for testing.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions using the Adam optimizer and the Gumbel-Soft Max technique, but it does not specify version numbers for any software libraries or programming languages used for implementation.
Experiment Setup	Yes	We do not extensively fine-tune hyperparameters. This suggests that our formulation of the ODE, including its functional form (Eq. 5) and initial conditions, is well-suited to the needs of the CA setting, making the optimization of the flow model relatively straightforward. Specifically, the Q network comprises three 128-dimensional tanh-activated fully connected layers. When m > 100, we increase the width of the last layer to 256. The σ network is simpler and has two 128-dimensional tanh-activated fully connected layers. Two important hyperparameters are D, the support size of the initial distribution, and K, the menu size. By default, D is set to 8, a relatively small number. K is 5000 when m ≤ 100 and is 20000 otherwise. The same menu size is used for our method, Menu-, Menu+, and Rochet Net. Notably, the menu size K is adequate to encompass all possible bundles for smaller numbers of items, such as m = 5 or 10. We show the impact of different values of D and K in ablation studies. Menu optimization for BUNDLEFLOW is conducted using the Adam optimizer with a learning rate of 0.3. λSOFTMAX is increased from 0.001 to 0.2 over the course of training. For comprehensive details on our hyperparameter settings, please refer to the codebase. For the baselines, we fine-tuned their hyperparameters so that they perform significantly better than the default Rochet Net setting. The modifications are achieved by performing a grid search to obtain the optimum combination of λSOFTMAX and learning rate that yields the best revenue and also guarantees convergence. Both Menu- and Menu+ use a learning rate of 0.3 and λSOFTMAX of 2, while Rochet Net uses a learning rate of 0.05 and λSOFTMAX of 20.