Weakly Coupled Deep Q-Networks

Authors: Ibrahim El Shar, Daniel Jiang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments show faster convergence compared to DQN and related techniques in settings with as many as 10 subproblems, 310 total actions, and a continuous state space.
Researcher Affiliation Collaboration Ibrahim El Shar Hitachi America Ltd., University of Pittsburgh Sunnyvale, CA ibrahim.elshar@hal.hitachi.com Daniel Jiang Meta, University of Pittsburgh New York, NY drjiang@meta.com
Pseudocode Yes Algorithm 1 Weekly Coupled DQN
Open Source Code Yes Our code is available at https://github.com/ibrahim-elshar/WCDQN_NeurIPS.
Open Datasets No The paper describes environment parameters for simulated problems (e.g., EV charging, multi-product inventory control, online stochastic ad matching) including how data is generated (e.g., Poisson distribution, Markov chains), but does not reference or provide access information for a pre-existing, publicly available dataset.
Dataset Splits No The paper describes reinforcement learning experiments, which typically involve continuous interaction with an environment rather than predefined dataset splits. It does not provide specific training, validation, or test dataset splits (e.g., 80/10/10 split) or cross-validation details for its experiments.
Hardware Specification Yes Experiments were ran on a shared memory cluster with dual 12-core Skylake CPU (Intel Xeon Gold 6126 2.60 GHz) and 192 GB RAM/node. Specifically, this work used the H2P cluster, which is supported by NSF award number OAC-2117681.
Software Dependencies No The paper mentions the use of a neural network architecture and the Adam optimizer, but it does not specify any software dependencies with concrete version numbers (e.g., Python, PyTorch, or TensorFlow versions) that would be needed for replication.
Experiment Setup Yes A discount factor of 0.9 is used for the EV charging problem and 0.99 for the multi-product inventory and online stochastic ad matching problems. In the tabular setting, we use a polynomial learning rate that depends on the state-action pairs visitation given by αn(s, a) = 1/νn(s, a)r, where νn(s, a) represent the number of times (s, a) has been visited up to iteration n, and r = 0.4. We also use an ϵ-greedy exploration policy, given by ϵ(s) = 1/ν(s)e, where ν(s) is the number of times the state s has been visited. We set e = 0.4. In the function approximation setting, we use an ϵ-greedy policy that decays ϵ from 1 to 0.05 after 30, 000 steps. All state-action value functions are initialized randomly. Each method was trained for 6,000 episodes for the EV charging problem, 5,000 for the multi-product inventory control problem, and 10,000 episodes for the online stochastic ad matching problem. The episode lengths for the EV charging, online ad stochastic ad matching, and multi-product inventory control problems are 50, 30, and 25, respectively. We performed 5 independent replications. We use a neural network architecture that consists of two hidden layers, with 64 and 32 hidden units respectively, for all algorithms. A rectified linear unit (Re LU) is used as the activation function for each hidden layer. The Adam optimizer [36] with a learning rate of 1.0 10 4 was used. For OTDQN, we use the same parameter settings as in He et al. [25]. For WCDQN, we use a Lagrangian multiplier λ [0, 10], with a 0.01 discretization. We also used an experience buffer of size 100,000 and initialized it with 10,000 experience tuples that were obtained using a random policy. For the WCDQN algorithm, we set the penalty coefficient κU to 10, after performing a small amount of manual hyperparameter tuning on the set {1, 2, 4, 10}.