Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Contextual Bilevel Reinforcement Learning for Incentive Alignment
Authors: Vinzenz Thoma, Barna Pรกsztor, Andreas Krause, Giorgia Ramponi, Yifan Hu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the performance of our algorithm for reward shaping and tax design. We illustrate the performance of HPGD in the Four-Rooms environment and on the Tax Design for Macroeconomic Model problem [34, 15] that we extend to multiple households with diverse preferences. |
| Researcher Affiliation | Academia | Vinzenz Thoma ETH AI Center EMAIL Barna Pasztor ETH AI Center EMAIL Andreas Krause ETH Zurich EMAIL Giorgia Ramponi University of Zurich EMAIL Yifan Hu EPFL & ETH Zurich EMAIL |
| Pseudocode | Yes | Algorithm 1 Hyper Policy Gradient Descent (HPGD), Algorithm 2 Gradient Estimator, Algorithm 3 Soft Value Iteration, Algorithm 4 Soft Qlearning, Algorithm 5 Decomposable Gradient Estimator, Algorithm 6 Vanilla Policy Gradient Algorithm, Algorithm 7 HPGD with vanilla soft Q-learning, Algorithm 8 HPGD with RT-Q. |
| Open Source Code | Yes | We implemented our experiments end-to-end in JAX [10] for its runtime benefits and ease of experimentation. The code is available at https://github.com/lasgroup/HPGD. |
| Open Datasets | No | The paper describes using a 'Four-Rooms environment' and a 'Tax Design for Macroeconomic Models problem [34, 15]' which are custom-designed or simulated environments/models rather than publicly available datasets with explicit access information (e.g., URL, DOI, repository). |
| Dataset Splits | No | The paper describes training within simulated environments and refers to 'outer iterations' and 'environment steps' but does not specify explicit train/validation/test dataset splits for static data. |
| Hardware Specification | Yes | We ran our experiments on a shared cluster equipped with various NVIDIA GPUs and AMD EPYC CPUs. Our default configuration for all experiments was a single GPU with 24 GB of memory, 16 CPU cores, and 4 GB of RAM per CPU core. |
| Software Dependencies | No | The paper states 'We implemented our experiments end-to-end in JAX [10]', but it does not specify a version number for JAX or any other software dependencies with their specific versions, which is required for reproducibility. |
| Experiment Setup | Yes | For the upper-level optimization problem, we use gradient norm clipping of 1.0. The learning rate for each algorithm has been chosen as the best performing one from [1.0, 0.5, 0.1, 0.05, 0.01] individually. Additionally, we tune the parameter C for the Zero-order algorithm on the values [0.1, 0.5, 1.0, 2.0, 5.0]. For Hyper Policy Gradient Descent, we sample 10, 000 environment steps for each gradient calculation. |