reproducibilityindex.ai

Incorporating Behavioral Constraints in Online AI Systems

Authors: Avinash Balakrishnan, Djallel Bouneffouf, Nicholas Mattei, Francesca Rossi44996

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We characterize the upper bound on the expected regret of the contextual bandit algorithm that underlies our agent and provide a case study with real world data in two application domains. Our experiments show that the designed agent is able to act within the set of behavior constraints without significantly degrading its overall reward performance.
Researcher Affiliation	Collaboration	IBM Research Yorktown Heights, NY, USA {avinash.bala,djallel.bouneffouf,francesca.rossi2}@ibm.com Tulane University New Orleans, LA, USA nsmattei@tulane.edu
Pseudocode	Yes	Algorithm 1 Contextual Thompson Sampling Algorithm Algorithm 2 Behavior Constrained Thompson Sampling
Open Source Code	No	The paper does not provide concrete access to source code or explicitly state that it is open-sourced or available.
Open Datasets	Yes	We start from the Movie Lens 20m dataset (Harper and Konstan 2016), which contains 20 million ratings of 27,000 movies by 138,000 users along with genre information.
Dataset Splits	Yes	For each of these experiments, we show the means over 5 cross validations using a different subset of 200 movies to train µe each time.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running its experiments.
Software Dependencies	No	The paper does not specify any ancillary software or library names with version numbers required for replication.
Experiment Setup	Yes	To give flexibility to our agent, we let the system designer to decide how much the guidelines given by the behavioral constraints should weigh on the decision of the agent during the online phase. So, to control the tradeoff between following the learned behavioral constraints and pursuing a greedy online-only policy, we expose a parameter of the algorithm called σonline. This parameter allows the system designer to smoothly transition between the two policy extremes, where σonline = 0.0 means that we are only following the learned constraints and are insensitive to the online reward, while σonline = 1.0 means we are only following the online rewards and not giving any weight to the learned constraints.