Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing

Authors: Arthur Delarue, Ross Anderson, Christian Tjandraatmadja

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach against several baselines on random and standard library instances, achieving an average gap against the OR-Tools routing solver of 1.7% on moderately-sized problems.
Researcher Affiliation Collaboration Arthur Delarue MIT Operations Research Center Cambridge, MA adelarue@mit.edu Ross Anderson Google Research Cambridge, MA rander@google.com Christian Tjandraatmadja Google Research Cambridge, MA ctjandra@google.com
Pseudocode No The paper describes its algorithms and problem formulations using text and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions that its implementation is in C++ and uses SCIP and Gurobi, but it does not provide any statement or link indicating that its own source code for the described methodology is open-source or publicly available.
Open Datasets Yes In Figure 1, we evaluate our performance on standard library instances from CVRPLIB [27, 38], comparing our results with those of OR-Tools.
Dataset Splits No The paper mentions using a portion of the data for 'out-of-sample MSE' to limit overfitting during value function training, but it does not specify explicit and reproducible train/validation/test splits for the main experimental evaluation or dataset usage, instead relying on random instance generation or standard library instances without detailing their partitioning for the study's scope.
Hardware Specification No The paper mentions that experiments were executed 'in parallel on a large cluster of machines' but provides no specific details regarding CPU models, GPU models, memory, or other hardware specifications.
Software Dependencies Yes Our implementation is in C++ and we use SCIP 6.0.2 as a MIP solver. Average time in seconds to solve a single action selection problem, with the MIP solvers Gurobi 8.1 and SCIP 6.0.2, averaged over 50 random instances.
Experiment Setup Yes We model ˆV as a small neural network with a fully-connected hidden layer and rectified linear unit (Re LU) activations. We train this neural network to minimize the mean-squared error (MSE) on the cumulative cost data gathered at the policy evaluation step. To limit overfitting, we randomly remove some data (between 10% and 20%) from the training set and use it to evaluate the out-of-sample MSE. Calling (s(i,k0), c(i,k0)) the i-th data point (out of Nk0) from iteration k0, the training objective in iteration k becomes Pk i=1 γk k0( ˆV (s(i,k0)) c(i,k0))2, where γ 2 (0, 1] denotes the retention factor. In Figure 3, we consider a neural network with 16 hidden Re LU nodes and show the effect of the batch SGD learning rate and the LASSO regularization parameter λ...