Estimating and Penalizing Induced Preference Shifts in Recommender Systems

Authors: Micah D Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In simulated experiments, we show that our learned preference dynamics model is effective in estimating user preferences and how they would respond to new recommenders. Additionally, we show that recommenders that optimize for staying in the trust region can avoid manipulative behaviors while still generating engagement.
Researcher Affiliation Academia 1Berkeley 2MIT. Correspondence to: <mdc@berkeley.edu>.
Pseudocode Yes Algorithm 1 Predicting future user preferences at timestep H under RS policy π, Algorithm 2 Predicting counterfactual user preferences at timestep T under policy π , given t timesteps of interaction data with π., Algorithm 3 Generating a trajectory for RL training and computing metrics
Open Source Code No The paper does not provide any explicit statement or link indicating that the source code for their methodology is publicly available.
Open Datasets No For all our experiments, we use a dataset of 10k trajectories (each of length 10), collected under a mixture of policies described in Appendix H.3. The paper does not provide access information (link, DOI, etc.) for this generated dataset.
Dataset Splits Yes For all our experiments, we use a dataset of 10k trajectories (each of length 10), collected under a mixture of policies described in Appendix H.3. 7.5k trajectories are used for training our models and 2.5k for computing the validation losses and accuracies reported in Sec. 7.1.
Hardware Specification Yes Training runs in less than 30 minutes for each condition on a Mac Book Pro 16 (2020).
Software Dependencies Yes As Von Mises distributions are not implemented in numpy (Harris et al., 2020), for simplicity we use clipped normal distributions (disregarding probability mass beyond 180 in either direction) in all places except for the the transformer output (which uses Py Torch s (Paszke et al., 2019) Von Mises implementation) and For RL optimization, we use PPO (Schulman et al., 2017) trained with Rllib (Liang et al., 2018).
Experiment Setup Yes For our learned human models, we use the BERT transformer architecture (similarly to (Sun et al., 2019)) with 2 layers, 2 attention heads, 4 sets of Von Mises distribution parameters, a learning rate of 0.00003, batch size of 500, and 100 epochs. and We use batch size 1200, minibatch size 600, 4 parallel workers, 0.005 learning rate, 50 gradient updates per minibatch per iteration, policy function clipping parameter of 0.5, value function clipping parameter of 50 and loss coefficient of 8, with an LSTM network with 64 cell size. γ = 0 for the myopic training and γ = 0.99 for long-horizon RL training.