Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reinforcement Learning for Joint Optimization of Multiple Rewards
Authors: Mridul Agarwal, Vaneet Aggarwal
JMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7. Evaluations In this section, we consider two systems. The first is the cellular scheduling, where multiple users connect to the base station. The second is a multiple-queue system which models multiple roads merging into a single lane. In both these systems, the proposed algorithms are compared with some baselines including the linear metric adaptation of reward at each time. |
| Researcher Affiliation | Academia | Mridul Agarwal EMAIL Purdue University West Lafayette IN 47907 Vaneet Aggarwal EMAIL Purdue University West Lafayette IN 47907 |
| Pseudocode | Yes | Algorithm 1 Model-Based Joint Decision Making Algorithm Algorithm 2 Model Free Joint Policy Gradient |
| Open Source Code | Yes | The algorithm source codes for the proposed algorithms have been provided at (Agarwal and Aggarwal, 2019). Mridul Agarwal and Vaneet Aggarwal. Source Code for Non-Linear Reinforcement Learning. https://github.rcac.purdue.edu/Clan-labs/non-markov-RL, 2019. |
| Open Datasets | No | The paper describes creating custom simulation environments based on specific models (e.g., 'Gauss-Markov channel model') and parameters (e.g., 'Rate values are practically observable data rates over a wireless network such as 4G-LTE' for Table 1, 'Arrival rates λk (in number of packets per step) for proportional fairness' for Table 3). These are self-generated simulation data/models rather than references to external, publicly available datasets with concrete access information. |
| Dataset Splits | No | The paper describes training procedures, such as 'We trained the SARSA algorithm and the model based Algorithm 1 for 5000 time steps' or 'used 1000 batches where each batch contains 36 trajectories of length 1000 time steps', but it does not specify explicit training/test/validation dataset splits. The evaluation is performed within simulation environments without conventional dataset splitting. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU, CPU models, memory) used to conduct the experiments or simulations. |
| Software Dependencies | No | In Algorithm 1, the convex optimization is solved using CVXPY Diamond and Boyd (2016). While CVXPY is mentioned, a specific version number for the library itself is not provided. Other software components like optimizers (Adam optimizer) are mentioned without specific version numbers. |
| Experiment Setup | Yes | The DQN neural network consists of two fully connected hidden layers with 100 units each with Re LU activation and one output layer with linear activation. We use γ = 0.99, ϵ = 0.05, and Adam optimizer with learning rate η = 0.01 to optimize the network. The batch size is 64 and the network is trained for 1500 episodes. Proposed Model Based Algorithm: We keep γ = 0.99 for implementation of Algorithm 1. We use a fixed episode length of 100 and update policy after every τ = 100 steps. Proposed Model Free Algorithm: The neural network consists of a single hidden layer with 200 neurons, each having Re LU activation function. The output layer uses softmax activation. The value of other hyperparameters are γ = 0.99, η = 1e-3, and batch size N = 100. |