reproducibilityindex.ai

Contextual Bilevel Reinforcement Learning for Incentive Alignment

Authors: Vinzenz Thoma, Barna Pásztor, Andreas Krause, Giorgia Ramponi, Yifan Hu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate the performance of our algorithm for reward shaping and tax design. We illustrate the performance of HPGD in the Four-Rooms environment and on the Tax Design for Macroeconomic Model problem [34, 15] that we extend to multiple households with diverse preferences.
Researcher Affiliation	Academia	Vinzenz Thoma ETH AI Center vinzenz.thoma@ai.ethz.ch Barna Pasztor ETH AI Center barna.pasztor@ai.ethz.ch Andreas Krause ETH Zurich krausea@ethz.ch Giorgia Ramponi University of Zurich giorgia.ramponi@uzh.ch Yifan Hu EPFL & ETH Zurich yifan.hu@epfl.ch
Pseudocode	Yes	Algorithm 1 Hyper Policy Gradient Descent (HPGD), Algorithm 2 Gradient Estimator, Algorithm 3 Soft Value Iteration, Algorithm 4 Soft Qlearning, Algorithm 5 Decomposable Gradient Estimator, Algorithm 6 Vanilla Policy Gradient Algorithm, Algorithm 7 HPGD with vanilla soft Q-learning, Algorithm 8 HPGD with RT-Q.
Open Source Code	Yes	We implemented our experiments end-to-end in JAX [10] for its runtime benefits and ease of experimentation. The code is available at https://github.com/lasgroup/HPGD.
Open Datasets	No	The paper describes using a 'Four-Rooms environment' and a 'Tax Design for Macroeconomic Models problem [34, 15]' which are custom-designed or simulated environments/models rather than publicly available datasets with explicit access information (e.g., URL, DOI, repository).
Dataset Splits	No	The paper describes training within simulated environments and refers to 'outer iterations' and 'environment steps' but does not specify explicit train/validation/test dataset splits for static data.
Hardware Specification	Yes	We ran our experiments on a shared cluster equipped with various NVIDIA GPUs and AMD EPYC CPUs. Our default configuration for all experiments was a single GPU with 24 GB of memory, 16 CPU cores, and 4 GB of RAM per CPU core.
Software Dependencies	No	The paper states 'We implemented our experiments end-to-end in JAX [10]', but it does not specify a version number for JAX or any other software dependencies with their specific versions, which is required for reproducibility.
Experiment Setup	Yes	For the upper-level optimization problem, we use gradient norm clipping of 1.0. The learning rate for each algorithm has been chosen as the best performing one from [1.0, 0.5, 0.1, 0.05, 0.01] individually. Additionally, we tune the parameter C for the Zero-order algorithm on the values [0.1, 0.5, 1.0, 2.0, 5.0]. For Hyper Policy Gradient Descent, we sample 10, 000 environment steps for each gradient calculation.