Explaining Reinforcement Learning Agents through Counterfactual Action Outcomes

Authors: Yotam Amitai, Yael Septon, Ofra Amir

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated the usefulness of COViz in supporting people s understanding of agents preferences and compare it with reward decomposition... We conducted two user studies in which participants were asked to characterize the reward function of the agent by ranking its preferences.
Researcher Affiliation Academia Faculty of Data and Decision Sciences, Technion Israel Institute of Technology yotama@campus.technion.ac.il, yaelfr1994@gmail.com, oamir@technion.ac.il
Pseudocode Yes The COViz algorithm. The pseudo-code for the algorithm is given in Algorithm 1 and parameters are summarized in Table 1 along with their user study values.
Open Source Code Yes Code repository: https://github.com/yotamitai/COViz
Open Datasets No The paper describes generating its own data for the user studies: 'To extract the counterfactual outcomes, we ran 200 simulations of each trained agent and saved their traces. States extracted from these traces were used in the study.' While it mentions using an 'open-source repositories: the Highway environment', it does not provide access information (link, DOI, or specific citation for public availability) for the data (traces from simulations) used in their user studies.
Dataset Splits No The paper does not provide explicit training, validation, or test dataset splits for the data used in the user studies. It mentions agent training ('2,000 simulations'), but not data splits for the human participant evaluation.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU models, memory amounts) used for running experiments are explicitly mentioned in the paper.
Software Dependencies No The paper mentions using a 'double deep Q-Network (DDQN)' architecture and basing its implementation on 'two open-source repositories: the Highway environment and its compatible agent architectures (Leurent 2018a,b)'. However, it does not provide specific version numbers for these or other software dependencies, such as deep learning frameworks or libraries.
Experiment Setup Yes Agents were trained using a double deep Q-Network (DDQN) (Mnih et al. 2015) architecture for 2,000 simulations, each with a maximum of 80 steps. The network input is the observation state represented by an array of size 25 (5X5). The input layer is followed by two fully connected hidden layers of length 256. The final layer connects to three different heads, each of which is designed to account for a specific desired behavior we specified above. Each head consists of a linear layer and outputs a Q-value vector of length of 5 that predicts the Q-value for each of the possible actions: moving left (upwards), idle, moving right (downwards), going faster, going slower. The COViz algorithm parameters are summarized in Table 1: k Trajectory length 7, Nsim Number of simulations 200, CFMeth Counterfactual method Second Best, n Summary budget 4, overlap Max shared states 5, IMeth Importance method Last-State.