Contextual Bilevel Reinforcement Learning for Incentive Alignment
Authors: Vinzenz Thoma, Barna Pásztor, Andreas Krause, Giorgia Ramponi, Yifan Hu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the performance of our algorithm for reward shaping and tax design. We illustrate the performance of HPGD in the Four-Rooms environment and on the Tax Design for Macroeconomic Model problem [34, 15] that we extend to multiple households with diverse preferences. |
| Researcher Affiliation | Academia | Vinzenz Thoma ETH AI Center vinzenz.thoma@ai.ethz.ch Barna Pasztor ETH AI Center barna.pasztor@ai.ethz.ch Andreas Krause ETH Zurich krausea@ethz.ch Giorgia Ramponi University of Zurich giorgia.ramponi@uzh.ch Yifan Hu EPFL & ETH Zurich yifan.hu@epfl.ch |
| Pseudocode | Yes | Algorithm 1 Hyper Policy Gradient Descent (HPGD), Algorithm 2 Gradient Estimator, Algorithm 3 Soft Value Iteration, Algorithm 4 Soft Qlearning, Algorithm 5 Decomposable Gradient Estimator, Algorithm 6 Vanilla Policy Gradient Algorithm, Algorithm 7 HPGD with vanilla soft Q-learning, Algorithm 8 HPGD with RT-Q. |
| Open Source Code | Yes | We implemented our experiments end-to-end in JAX [10] for its runtime benefits and ease of experimentation. The code is available at https://github.com/lasgroup/HPGD. |
| Open Datasets | No | The paper describes using a 'Four-Rooms environment' and a 'Tax Design for Macroeconomic Models problem [34, 15]' which are custom-designed or simulated environments/models rather than publicly available datasets with explicit access information (e.g., URL, DOI, repository). |
| Dataset Splits | No | The paper describes training within simulated environments and refers to 'outer iterations' and 'environment steps' but does not specify explicit train/validation/test dataset splits for static data. |
| Hardware Specification | Yes | We ran our experiments on a shared cluster equipped with various NVIDIA GPUs and AMD EPYC CPUs. Our default configuration for all experiments was a single GPU with 24 GB of memory, 16 CPU cores, and 4 GB of RAM per CPU core. |
| Software Dependencies | No | The paper states 'We implemented our experiments end-to-end in JAX [10]', but it does not specify a version number for JAX or any other software dependencies with their specific versions, which is required for reproducibility. |
| Experiment Setup | Yes | For the upper-level optimization problem, we use gradient norm clipping of 1.0. The learning rate for each algorithm has been chosen as the best performing one from [1.0, 0.5, 0.1, 0.05, 0.01] individually. Additionally, we tune the parameter C for the Zero-order algorithm on the values [0.1, 0.5, 1.0, 2.0, 5.0]. For Hyper Policy Gradient Descent, we sample 10, 000 environment steps for each gradient calculation. |