Incorporating Behavioral Constraints in Online AI Systems
Authors: Avinash Balakrishnan, Djallel Bouneffouf, Nicholas Mattei, Francesca Rossi44996
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We characterize the upper bound on the expected regret of the contextual bandit algorithm that underlies our agent and provide a case study with real world data in two application domains. Our experiments show that the designed agent is able to act within the set of behavior constraints without significantly degrading its overall reward performance. |
| Researcher Affiliation | Collaboration | IBM Research Yorktown Heights, NY, USA {avinash.bala,djallel.bouneffouf,francesca.rossi2}@ibm.com Tulane University New Orleans, LA, USA nsmattei@tulane.edu |
| Pseudocode | Yes | Algorithm 1 Contextual Thompson Sampling Algorithm Algorithm 2 Behavior Constrained Thompson Sampling |
| Open Source Code | No | The paper does not provide concrete access to source code or explicitly state that it is open-sourced or available. |
| Open Datasets | Yes | We start from the Movie Lens 20m dataset (Harper and Konstan 2016), which contains 20 million ratings of 27,000 movies by 138,000 users along with genre information. |
| Dataset Splits | Yes | For each of these experiments, we show the means over 5 cross validations using a different subset of 200 movies to train µe each time. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not specify any ancillary software or library names with version numbers required for replication. |
| Experiment Setup | Yes | To give flexibility to our agent, we let the system designer to decide how much the guidelines given by the behavioral constraints should weigh on the decision of the agent during the online phase. So, to control the tradeoff between following the learned behavioral constraints and pursuing a greedy online-only policy, we expose a parameter of the algorithm called σonline. This parameter allows the system designer to smoothly transition between the two policy extremes, where σonline = 0.0 means that we are only following the learned constraints and are insensitive to the online reward, while σonline = 1.0 means we are only following the online rewards and not giving any weight to the learned constraints. |