reproducibilityindex.ai

Reward Design for Justifiable Sequential Decision-Making

Authors: Aleksa Sukovic, Goran Radanovic

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive empirical evaluation of our approach on a real-world problem of treating sepsis, testing the performance and justifiability of policies trained through our framework (Sec. 5.2), as well as the effectiveness and robustness of argumentative agents (Sec. 5.3, Sec. 5.4, and Sec. 5.5).
Researcher Affiliation	Academia	Max Planck Institute for Software Systems 1 Saarland University 2 {asukovic, gradanovic}@mpi-sws.org
Pseudocode	No	The paper does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is publicly available at github.com/aleksa-sukovic/iclr2024-reward-design-for-justifiable-rl.
Open Datasets	Yes	Data for our cohort were obtained following steps outlined in Komorowski et al. (2018), utilizing MIMIC-III v1.4 database (Johnson et al., 2016).
Dataset Splits	Yes	The dataset is split into chunks of 70%, 15%, 15% used for training, validation, and testing respectively.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types) for running its experiments.
Software Dependencies	No	The paper mentions software components like PPO, Adam optimizer, Deep-Q networks, but does not provide specific version numbers for these or other relevant libraries/frameworks (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup	Yes	The learning is done for a total of 100 epochs using batches of 64 comparisons sampled from the preference dataset D, Adam optimizer and a learning rate of 5e-4. (Sec 4.3 Judge Model). [...] To train the agent, we use PPO (Schulman et al., 2017) and examine two optimization strategies, namely self-play and maxmin. [...] The learning is done in batches of 256 (s, a, r, s ) tuples sampled from a Prioritized Experience Replay buffer (Schaul et al., 2015) using a learning rate of 1e-4, for a total of 25k iterations (Sec 4.3 Justifiable Agent). [...] The full list of used hyperparameters is given in Table 3.