Reward Design for Justifiable Sequential Decision-Making
Authors: Aleksa Sukovic, Goran Radanovic
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive empirical evaluation of our approach on a real-world problem of treating sepsis, testing the performance and justifiability of policies trained through our framework (Sec. 5.2), as well as the effectiveness and robustness of argumentative agents (Sec. 5.3, Sec. 5.4, and Sec. 5.5). |
| Researcher Affiliation | Academia | Max Planck Institute for Software Systems 1 Saarland University 2 {asukovic, gradanovic}@mpi-sws.org |
| Pseudocode | No | The paper does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is publicly available at github.com/aleksa-sukovic/iclr2024-reward-design-for-justifiable-rl. |
| Open Datasets | Yes | Data for our cohort were obtained following steps outlined in Komorowski et al. (2018), utilizing MIMIC-III v1.4 database (Johnson et al., 2016). |
| Dataset Splits | Yes | The dataset is split into chunks of 70%, 15%, 15% used for training, validation, and testing respectively. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types) for running its experiments. |
| Software Dependencies | No | The paper mentions software components like PPO, Adam optimizer, Deep-Q networks, but does not provide specific version numbers for these or other relevant libraries/frameworks (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | The learning is done for a total of 100 epochs using batches of 64 comparisons sampled from the preference dataset D, Adam optimizer and a learning rate of 5e-4. (Sec 4.3 Judge Model). [...] To train the agent, we use PPO (Schulman et al., 2017) and examine two optimization strategies, namely self-play and maxmin. [...] The learning is done in batches of 256 (s, a, r, s ) tuples sampled from a Prioritized Experience Replay buffer (Schaul et al., 2015) using a learning rate of 1e-4, for a total of 25k iterations (Sec 4.3 Justifiable Agent). [...] The full list of used hyperparameters is given in Table 3. |