Contextual Bilevel Reinforcement Learning for Incentive Alignment

Authors: Vinzenz Thoma, Barna Pásztor, Andreas Krause, Giorgia Ramponi, Yifan Hu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the performance of our algorithm for reward shaping and tax design. We illustrate the performance of HPGD in the Four-Rooms environment and on the Tax Design for Macroeconomic Model problem [34, 15] that we extend to multiple households with diverse preferences.
Researcher Affiliation Academia Vinzenz Thoma ETH AI Center vinzenz.thoma@ai.ethz.ch Barna Pasztor ETH AI Center barna.pasztor@ai.ethz.ch Andreas Krause ETH Zurich krausea@ethz.ch Giorgia Ramponi University of Zurich giorgia.ramponi@uzh.ch Yifan Hu EPFL & ETH Zurich yifan.hu@epfl.ch
Pseudocode Yes Algorithm 1 Hyper Policy Gradient Descent (HPGD), Algorithm 2 Gradient Estimator, Algorithm 3 Soft Value Iteration, Algorithm 4 Soft Qlearning, Algorithm 5 Decomposable Gradient Estimator, Algorithm 6 Vanilla Policy Gradient Algorithm, Algorithm 7 HPGD with vanilla soft Q-learning, Algorithm 8 HPGD with RT-Q.
Open Source Code Yes We implemented our experiments end-to-end in JAX [10] for its runtime benefits and ease of experimentation. The code is available at https://github.com/lasgroup/HPGD.
Open Datasets No The paper describes using a 'Four-Rooms environment' and a 'Tax Design for Macroeconomic Models problem [34, 15]' which are custom-designed or simulated environments/models rather than publicly available datasets with explicit access information (e.g., URL, DOI, repository).
Dataset Splits No The paper describes training within simulated environments and refers to 'outer iterations' and 'environment steps' but does not specify explicit train/validation/test dataset splits for static data.
Hardware Specification Yes We ran our experiments on a shared cluster equipped with various NVIDIA GPUs and AMD EPYC CPUs. Our default configuration for all experiments was a single GPU with 24 GB of memory, 16 CPU cores, and 4 GB of RAM per CPU core.
Software Dependencies No The paper states 'We implemented our experiments end-to-end in JAX [10]', but it does not specify a version number for JAX or any other software dependencies with their specific versions, which is required for reproducibility.
Experiment Setup Yes For the upper-level optimization problem, we use gradient norm clipping of 1.0. The learning rate for each algorithm has been chosen as the best performing one from [1.0, 0.5, 0.1, 0.05, 0.01] individually. Additionally, we tune the parameter C for the Zero-order algorithm on the values [0.1, 0.5, 1.0, 2.0, 5.0]. For Hyper Policy Gradient Descent, we sample 10, 000 environment steps for each gradient calculation.