Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reinforcement Learning for Joint Optimization of Multiple Rewards

Authors: Mridul Agarwal, Vaneet Aggarwal

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 7. Evaluations In this section, we consider two systems. The first is the cellular scheduling, where multiple users connect to the base station. The second is a multiple-queue system which models multiple roads merging into a single lane. In both these systems, the proposed algorithms are compared with some baselines including the linear metric adaptation of reward at each time.
Researcher Affiliation Academia Mridul Agarwal EMAIL Purdue University West Lafayette IN 47907 Vaneet Aggarwal EMAIL Purdue University West Lafayette IN 47907
Pseudocode Yes Algorithm 1 Model-Based Joint Decision Making Algorithm Algorithm 2 Model Free Joint Policy Gradient
Open Source Code Yes The algorithm source codes for the proposed algorithms have been provided at (Agarwal and Aggarwal, 2019). Mridul Agarwal and Vaneet Aggarwal. Source Code for Non-Linear Reinforcement Learning. https://github.rcac.purdue.edu/Clan-labs/non-markov-RL, 2019.
Open Datasets No The paper describes creating custom simulation environments based on specific models (e.g., 'Gauss-Markov channel model') and parameters (e.g., 'Rate values are practically observable data rates over a wireless network such as 4G-LTE' for Table 1, 'Arrival rates λk (in number of packets per step) for proportional fairness' for Table 3). These are self-generated simulation data/models rather than references to external, publicly available datasets with concrete access information.
Dataset Splits No The paper describes training procedures, such as 'We trained the SARSA algorithm and the model based Algorithm 1 for 5000 time steps' or 'used 1000 batches where each batch contains 36 trajectories of length 1000 time steps', but it does not specify explicit training/test/validation dataset splits. The evaluation is performed within simulation environments without conventional dataset splitting.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU, CPU models, memory) used to conduct the experiments or simulations.
Software Dependencies No In Algorithm 1, the convex optimization is solved using CVXPY Diamond and Boyd (2016). While CVXPY is mentioned, a specific version number for the library itself is not provided. Other software components like optimizers (Adam optimizer) are mentioned without specific version numbers.
Experiment Setup Yes The DQN neural network consists of two fully connected hidden layers with 100 units each with Re LU activation and one output layer with linear activation. We use γ = 0.99, ϵ = 0.05, and Adam optimizer with learning rate η = 0.01 to optimize the network. The batch size is 64 and the network is trained for 1500 episodes. Proposed Model Based Algorithm: We keep γ = 0.99 for implementation of Algorithm 1. We use a fixed episode length of 100 and update policy after every τ = 100 steps. Proposed Model Free Algorithm: The neural network consists of a single hidden layer with 200 neurons, each having Re LU activation function. The output layer uses softmax activation. The value of other hyperparameters are γ = 0.99, η = 1e-3, and batch size N = 100.