A Reinforcement-Learning-Based Multiple-Column Selection Strategy for Column Generation
Authors: Haofeng Yuan, Lichang Fang, Shiji Song
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of our approach is evaluated on two sets of problems: the cutting stock problem and the graph coloring problem. Compared to several widely used single-column and multiple-column selection strategies, our RL-based multiple-column selection strategy leads to faster convergence and achieves remarkable reductions in the number of CG iterations and runtime. |
| Researcher Affiliation | Academia | Haofeng Yuan, Lichang Fang, Shiji Song* Department of Automation, BNRist, Tsinghua University {yhf22, fanglc22}@mails.tsinghua.edu.cn, shijis@mail.tsinghua.edu.cn |
| Pseudocode | No | The information is insufficient. The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The information is insufficient. The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | The problem instances are generated according to the rules for random instances in BPPLIB (Delorme, Iori, and Martello 2016, 2018), a widely used benchmark for bin packing and cutting stock problems. Similar to the CSP, the GCP instances are divided into three categories, corresponding to the number of nodes N = 30, 40, and 50 respectively. The CGP instances are generated according to the rules for random graphs in (Mehrotra and Trick 1996). |
| Dataset Splits | No | The information is insufficient. The paper mentions training and evaluation instances but does not provide specific details on training, validation, and test splits (e.g., percentages, counts, or explicit validation sets) in a reproducible manner. |
| Hardware Specification | No | The information is insufficient. The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments. It only vaguely mentions 'our RL-based strategies (on GPU)'. |
| Software Dependencies | Yes | In general, several sub-optimal solutions of PP, which form a candidate column pool, can be obtained through dynamic programming methods or commercial solvers such as Gurobi (Gurobi Optimization, LLC 2023). |
| Experiment Setup | Yes | We select 5 out of 10 candidate columns at each iteration, which strikes a balance between the number of iterations and the cost per iteration in our task. To guarantee convergence, we force the optimal solution of the PP to always be selected. We set the number of layers N1 = N2 = N3 = N4 = 3 for the MLPs in the encoder and decoder. The weights in the reward function are set to α = 300 and β = 0.02 to balance the reward scales and the discount factor γ is set to 0.9. We use PPO with a clipping threshold ϵ = 0.2, and the Adam optimizer with a learning rate 1 10 3 to train the RL model. |