Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization
Authors: Quentin Cappart, Thierry Moisan, Louis-Martin Rousseau, Isabeau Prémont-Schwarz, Andre A. Cire3677-3687
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally show that our solver is efficient to solve three challenging problems: the traveling salesman problem with time windows, the 4-moments portfolio optimization problem, and the 0-1 knapsack problem. Results obtained show that the framework introduced outperforms the stand-alone RL and CP solutions, while being competitive with industrial solvers. |
| Researcher Affiliation | Collaboration | 1 Ecole Polytechnique de Montréal, Montreal, Canada 2 Element AI, Montreal, Canada 3 University of Toronto Scarborough, Toronto, Canada |
| Pseudocode | Yes | Algorithm 1: Ba B-DQN Search Procedure. Algorithm 2: ILDS-DQN Search Procedure. Algorithm 3: RBS-PPO Search Procedure. |
| Open Source Code | Yes | Our detailed contributions are as follows: ... (5) The open-source release of our code and models, in order to ease the future research in this field1. 1https://github.com/qcappart/hybrid-cp-rl-solver |
| Open Datasets | No | The training is done using randomly generated instances sampled from a similar distribution to those we want to solve. Such an assumption is common in the vast majority of works tackling NP-hard problems using ML (Khalil et al. 2017; Kool, Van Hoof, and Welling 2018; Cappart et al. 2019), and, despite being strong, has nonetheless a practical interest when repeatedly solving similar instances of the same problem (e.g., package shipping by retailers) The paper mentions generating instances but does not provide concrete access to a publicly available dataset or a specific citation for one. |
| Dataset Splits | Yes | A new model is recorded after each 100 episodes of the RL algorithm and the model achieving the best average reward on a validation set of 100 instances generated in the same way as for the training is selected. The final evaluation is done on 100 other instances (still randomly generated in the same manner) using Intel Xeon E5-2650 CPU with 32GB of RAM and a time limit of 60 minutes. |
| Hardware Specification | Yes | memory consumption to 32 GB and 1 GPU (Tesla V100-SXM2-32GB) is used per model. and The final evaluation is done on 100 other instances (still randomly generated in the same manner) using Intel Xeon E5-2650 CPU with 32GB of RAM and a time limit of 60 minutes. |
| Software Dependencies | No | Algorithms used for training have been implemented in Python and Pytorch (Paszke et al. 2019) is used for designing the neural networks. Library DGL (Wang et al. 2019) is used for implementing graph embedding, and Set Transformer (Lee et al. 2018) for set embedding. The CP solver used is Gecode (Schulte, Lagerkvist, and Tack 2006), which has the benefit to be opensource and to offer a lot of freedom for designing new search procedures. As Gecode is implemented in C++, an operability interface with Python code is required. It is done using Pybind11 (Jakob, Rhinelander, and Moldovan 2017). The paper lists software but not their specific version numbers. |
| Experiment Setup | No | Training time is limited to 48 hours, memory consumption to 32 GB and 1 GPU (Tesla V100-SXM2-32GB) is used per model. Models are trained with a single run. A new model is recorded after each 100 episodes of the RL algorithm and the model achieving the best average reward on a validation set of 100 instances generated in the same way as for the training is selected. The final evaluation is done on 100 other instances (still randomly generated in the same manner) using Intel Xeon E5-2650 CPU with 32GB of RAM and a time limit of 60 minutes. Detailed information about the hyper-parameters tested and selected are proposed in the supplementary material. The specific hyper-parameters are deferred to supplementary material. |