Estimating and Penalizing Induced Preference Shifts in Recommender Systems
Authors: Micah D Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In simulated experiments, we show that our learned preference dynamics model is effective in estimating user preferences and how they would respond to new recommenders. Additionally, we show that recommenders that optimize for staying in the trust region can avoid manipulative behaviors while still generating engagement. |
| Researcher Affiliation | Academia | 1Berkeley 2MIT. Correspondence to: <mdc@berkeley.edu>. |
| Pseudocode | Yes | Algorithm 1 Predicting future user preferences at timestep H under RS policy π, Algorithm 2 Predicting counterfactual user preferences at timestep T under policy π , given t timesteps of interaction data with π., Algorithm 3 Generating a trajectory for RL training and computing metrics |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating that the source code for their methodology is publicly available. |
| Open Datasets | No | For all our experiments, we use a dataset of 10k trajectories (each of length 10), collected under a mixture of policies described in Appendix H.3. The paper does not provide access information (link, DOI, etc.) for this generated dataset. |
| Dataset Splits | Yes | For all our experiments, we use a dataset of 10k trajectories (each of length 10), collected under a mixture of policies described in Appendix H.3. 7.5k trajectories are used for training our models and 2.5k for computing the validation losses and accuracies reported in Sec. 7.1. |
| Hardware Specification | Yes | Training runs in less than 30 minutes for each condition on a Mac Book Pro 16 (2020). |
| Software Dependencies | Yes | As Von Mises distributions are not implemented in numpy (Harris et al., 2020), for simplicity we use clipped normal distributions (disregarding probability mass beyond 180 in either direction) in all places except for the the transformer output (which uses Py Torch s (Paszke et al., 2019) Von Mises implementation) and For RL optimization, we use PPO (Schulman et al., 2017) trained with Rllib (Liang et al., 2018). |
| Experiment Setup | Yes | For our learned human models, we use the BERT transformer architecture (similarly to (Sun et al., 2019)) with 2 layers, 2 attention heads, 4 sets of Von Mises distribution parameters, a learning rate of 0.00003, batch size of 500, and 100 epochs. and We use batch size 1200, minibatch size 600, 4 parallel workers, 0.005 learning rate, 50 gradient updates per minibatch per iteration, policy function clipping parameter of 0.5, value function clipping parameter of 50 and loss coefficient of 8, with an LSTM network with 64 cell size. γ = 0 for the myopic training and γ = 0.99 for long-horizon RL training. |