Learning from Delayed Outcomes via Proxies with Applications to Recommender Systems

Authors: Timothy Arthur Mann, Sven Gowal, Andras Gyorgy, Huiyi Hu, Ray Jiang, Balaji Lakshminarayanan, Prav Srinivasan

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on two real-world datasets for predicting human behavior show that RFF outperforms both FF and a direct forecaster that does not make use of the proxy.
Researcher Affiliation Industry 1Deep Mind, London, UK. Correspondence to: Timothy A. Mann <timothymann@google.com>.
Pseudocode Yes Algorithm 1 Factored Forecaster for Delayed Outcomes
Open Source Code No The paper does not provide an explicit statement or a link for open-source code availability.
Open Datasets Yes We obtained historical information about commits to Git Hub repositories from the Big Query Git Hub database.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits (e.g., percentages, sample counts, or specific predefined split references).
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions optimization algorithms like Stochastic Gradient Descent and Adam optimizer, but does not provide specific software names with version numbers (e.g., Python 3.x, TensorFlow x.x, PyTorch x.x).
Experiment Setup Yes We update network weights using Stochastic Gradient Descent3 with a learning rate of 0.1 minimizing the negative log-loss. We apply L2 regularization on the weights with a scale parameter of 0.01. For the experiment explained in Section 4.1, the networks predicting the outcome distribution from proxies use a learning rate of 1. Network towers have two hidden layers with 40 and 20 units. The training buffer has a size of 1,000. We start training once we have 128 examples in the buffer and perform one gradient step with a batch size of 128 every four rounds. For the experiment explained in Section 4.2, the networks predicting the outcome distribution from proxies use a learning rate of 0.1. Network towers have two hidden layers with 20 and 10 units. The training buffer has a size of 3,000. We start training once we have 500 examples in the buffer and perform 20 gradient steps with a batch size of 128 every 1,000 rounds.