SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets
Authors: Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Tushar Chandra, Craig Boutilier
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Empirical Evaluation: Simulation, 6 Empirical Evaluation: Live Experiments |
| Researcher Affiliation | Collaboration | 1Google Research 2Department of Computer Science, University of Texas at Austin |
| Pseudocode | No | The paper presents mathematical equations and descriptions of updates, but it does not include formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions that TensorFlow software is available, but it does not provide an explicit statement or link for the open-source code of the methodology described in this paper. |
| Open Datasets | No | We construct a simulation environment since most public datasets are point-wise, static, and not designed for evaluating multi-step user-recommender interactions. |
| Dataset Splits | No | The paper describes evaluating strategies on 5000 simulated users but does not specify explicit train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper mentions training on 'large-scale recommenders' and using 'distributed training', but it does not provide specific details about the hardware used, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper states that 'The model is trained using Tensor Flow', but it does not specify the version number of TensorFlow or other software dependencies. |
| Experiment Setup | Yes | The paper specifies parameters for the simulation environment, such as '|T| = 20, m = 10, k = 3', and describes the training approach in live experiments: 'We train on-policy over pairs of consecutive start page visits, with LTV labels computed using Eq. (14), and use top-k optimization for both training and serving'. |