Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ASP-Driven Emergency Planning for Norm Violations in Reinforcement Learning

Authors: Sebastian Adam, Thomas Eiter

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we showcase the practical applicability of the framework in multiple domains, including deterministic and nondeterministic environments. For evaluation, we compare the norm compliance of using a RL policy π (which is unaware of additional norms) against the use of π in combination with our framework. In particular, we are interested in the effect of the horizon k on norm violations.
Researcher Affiliation Academia Institute of Logic and Computation, Vienna University of Technology Favoritenstraße 9-11, A-1040 Vienna, Austria EMAIL
Pseudocode Yes Listing 1: Excerpt of Pgen 1 plant(4,4). 2 player(0,2,6). 3 p_n(T)|p_s(T)|p_w(T)|p_e(T) :T=1..k. 4 player(T,C ,R) :player(T-1,C,R), p_w(T), C =C-1. ... 8 : player(T,C,R), plant(C,R). [1@2] 9 #maximize {R@1,T : reward(R,A,T)}. Listing 2: Excerpt of Pcheck 10 frog(0,6,2). 11 f_n(T)|f_s(T)|f_w(T)|f_e(T) :T=1..k. 12 frog(T,C ,R) :frog(T-1,C,R), f_w(T), C =C-1. ... 16 ok(0). 17 ok(T) :player(T,C,R), frog(T,C ,R ), C !=C, ok(T-1). ... 19 :not sat. 20 sat :ok(k). 21 f_n(T) :T = 1..k, sat. ... 25 sat :frog(T,C,R), wall(C,R). 26 good(T) | bad(T) :T=1..k. 27 ok(T) :bad(T). 28 #minimize {1@3,T : bad(T)}.
Open Source Code Yes Git https://github.com/S3basuchian/emergency-planning
Open Datasets Yes We extended the Berkeley AI code for Pacman (De Nero, Klein, and Abbeel 2014) and use associated instances: small Classic (grid size 20 7; 2 ghosts), medium Classic (20 11;2) and original Classic (28 27;4).
Dataset Splits No The paper describes generating '80 instances of Gardener' and testing '1000 times' for Pacman, but does not provide specific train/test/validation dataset splits (e.g., percentages, sample counts, or explicit split files) for its experiments.
Hardware Specification Yes The experiments were run on a Linux server with two Intel Xeon CPU E5-2650 v4 (12 cores @ 2.20GHz, no hyperthreading) and 256GB RAM.
Software Dependencies Yes We use the ASP solver clingo (version 5.7.1) to solve the ASP program of the framework. Further code and the RL implementations are in Python. Using the sota CCN simulator ndn SIM 2.9 (Mastorakis, Afanasyev, and Zhang 2017), we simulate this network for 30 secs on 5 instances differing in type and frequency of consumer requests while observing the cache (size 18) of a router that is a bottleneck to all traffic.
Experiment Setup Yes We use utility-based policy fixing to prevent that the agent gets stuck by unsolvability (i.e., finds no strict k-policy fix) and ease achieving the goal. We use the ASP solver clingo (version 5.7.1) to solve the ASP program of the framework. Further code and the RL implementations are in Python. The experiments were run on a Linux server with two Intel Xeon CPU E5-2650 v4 (12 cores @ 2.20GHz, no hyperthreading) and 256GB RAM. In the policy fix, the penalty for killing a living entity is higher than any action reward. Plans admitting repeated states are penalized to discourage looping in norm-compliant trajectories. An extra reward for reaching the target is given, and rewards for obeying policy preference inrease on cell revisits. Earlier violations and norm adherence are weighted higher to discourage immediate violations or policy deviations.