Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Mean-Field RL for Large-Scale Unit-Capacity Pickup-and-Delivery Problems
Authors: Kai Cui, Sharif Azem, Christian Fabian, Kirill Kuroptev, Ramin Khalili, Osama Abboud, Florian Steinke, Heinz Koeppl
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate by comparing against OR-Tools, Py VRP and heuristics, showing potential generalization capabilities and speeds of MFC for different large-scale settings, where existing methods for discrete optimization can be expensive. Overall, our work establishes a novel synthesis of MFC-based RL techniques, vehicle routing problems and clustering approximations, to solve a hard discrete optimization problem of practical use in a scalable way. |
| Researcher Affiliation | Collaboration | 1Technische Universität Darmstadt, 2Huawei EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 MFVRP-G 1: Input: µ0, ν0, distance matrix. 2: for iterations n = 1, 2, . . . do 3: for time steps t = 0, . . . , Blen 1 do 4: Sample MFC action Tt ˆπθ(Tt | µt, νt). 5: Compute reward r(µt) and next MFs µt+1, νt+1 as in Eqs. (2), (3), (4), and termination / reset flag dt+1 {0, 1}. 6: end for 7: Update ˆπθ on minibatch sampled from data {((µt, νt), Tt, rt, dt+1, (µt+1, νt+1))}t 0. 8: end for |
| Open Source Code | Yes | 1Code can be found under http://github.com/tudkcui/MFVRP |
| Open Datasets | No | In this study, we generate 3 different types of datasets, called uniform, finite and cities. For each type of dataset we vary the amount of objects to be delivered M = {100, 1000, 5000, 50000}, and for the finite and cities distribution we also vary the amount of clusters, K {3, 5}. |
| Dataset Splits | No | We generate 50 instances for each combination of M and k, resulting in 200 VRP instances for uniform, and 400 VRP instances for cities and finite. |
| Hardware Specification | Yes | For our training, we used a GPU with 40 GB VRAM and over 19 TFLOPS of FP32 compute performance for a single hour for each fixed K in MFVRP-G (pretraining), and around two to three minutes per MFVRP-F or MFVRP-FD run for a particular problem instance. |
| Software Dependencies | No | For the implementation, we use JAX with PPO (Schulman et al., 2017) based on Pure Jax RL (Lu et al., 2022). |
| Experiment Setup | Yes | We learned policies with two hidden layers of 256 nodes and tanh activations. The discount factor is set to γ = 0.99, and with GAE λ = 0.95. The minibatch sizes are 10000 for MFVRP-F, as well as annealed from 10000 to 250000 for MFVRP-G, with 8 PPO steps per training batch. The clip parameter is set to 0.2. The learning rate was linearly annealed from 0.00003 down to zero. As usual in RL, our implementation uses discounted objectives and stationary time-independent policies (Guo et al., 2022), and we also transform the current time t into a one-hot observation. Unless stated otherwise, we use the parameters K = 5, R = 5, C = 15, cd = 50, cmiss = 30, rbonus = 5 and datasets for clustered and uniform scenarios with all locations in X = [ 1, 1]2, generated as described in the Appendix. |