Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mean-Field RL for Large-Scale Unit-Capacity Pickup-and-Delivery Problems

Authors: Kai Cui, Sharif Azem, Christian Fabian, Kirill Kuroptev, Ramin Khalili, Osama Abboud, Florian Steinke, Heinz Koeppl

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate by comparing against OR-Tools, Py VRP and heuristics, showing potential generalization capabilities and speeds of MFC for different large-scale settings, where existing methods for discrete optimization can be expensive. Overall, our work establishes a novel synthesis of MFC-based RL techniques, vehicle routing problems and clustering approximations, to solve a hard discrete optimization problem of practical use in a scalable way.
Researcher Affiliation Collaboration 1Technische Universität Darmstadt, 2Huawei EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 MFVRP-G 1: Input: µ0, ν0, distance matrix. 2: for iterations n = 1, 2, . . . do 3: for time steps t = 0, . . . , Blen 1 do 4: Sample MFC action Tt ˆπθ(Tt | µt, νt). 5: Compute reward r(µt) and next MFs µt+1, νt+1 as in Eqs. (2), (3), (4), and termination / reset flag dt+1 {0, 1}. 6: end for 7: Update ˆπθ on minibatch sampled from data {((µt, νt), Tt, rt, dt+1, (µt+1, νt+1))}t 0. 8: end for
Open Source Code Yes 1Code can be found under http://github.com/tudkcui/MFVRP
Open Datasets No In this study, we generate 3 different types of datasets, called uniform, finite and cities. For each type of dataset we vary the amount of objects to be delivered M = {100, 1000, 5000, 50000}, and for the finite and cities distribution we also vary the amount of clusters, K {3, 5}.
Dataset Splits No We generate 50 instances for each combination of M and k, resulting in 200 VRP instances for uniform, and 400 VRP instances for cities and finite.
Hardware Specification Yes For our training, we used a GPU with 40 GB VRAM and over 19 TFLOPS of FP32 compute performance for a single hour for each fixed K in MFVRP-G (pretraining), and around two to three minutes per MFVRP-F or MFVRP-FD run for a particular problem instance.
Software Dependencies No For the implementation, we use JAX with PPO (Schulman et al., 2017) based on Pure Jax RL (Lu et al., 2022).
Experiment Setup Yes We learned policies with two hidden layers of 256 nodes and tanh activations. The discount factor is set to γ = 0.99, and with GAE λ = 0.95. The minibatch sizes are 10000 for MFVRP-F, as well as annealed from 10000 to 250000 for MFVRP-G, with 8 PPO steps per training batch. The clip parameter is set to 0.2. The learning rate was linearly annealed from 0.00003 down to zero. As usual in RL, our implementation uses discounted objectives and stationary time-independent policies (Guo et al., 2022), and we also transform the current time t into a one-hot observation. Unless stated otherwise, we use the parameters K = 5, R = 5, C = 15, cd = 50, cmiss = 30, rbonus = 5 and datasets for clustered and uniform scenarios with all locations in X = [ 1, 1]2, generated as described in the Appendix.