Learning from Delayed Outcomes via Proxies with Applications to Recommender Systems
Authors: Timothy Arthur Mann, Sven Gowal, Andras Gyorgy, Huiyi Hu, Ray Jiang, Balaji Lakshminarayanan, Prav Srinivasan
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two real-world datasets for predicting human behavior show that RFF outperforms both FF and a direct forecaster that does not make use of the proxy. |
| Researcher Affiliation | Industry | 1Deep Mind, London, UK. Correspondence to: Timothy A. Mann <timothymann@google.com>. |
| Pseudocode | Yes | Algorithm 1 Factored Forecaster for Delayed Outcomes |
| Open Source Code | No | The paper does not provide an explicit statement or a link for open-source code availability. |
| Open Datasets | Yes | We obtained historical information about commits to Git Hub repositories from the Big Query Git Hub database. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits (e.g., percentages, sample counts, or specific predefined split references). |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions optimization algorithms like Stochastic Gradient Descent and Adam optimizer, but does not provide specific software names with version numbers (e.g., Python 3.x, TensorFlow x.x, PyTorch x.x). |
| Experiment Setup | Yes | We update network weights using Stochastic Gradient Descent3 with a learning rate of 0.1 minimizing the negative log-loss. We apply L2 regularization on the weights with a scale parameter of 0.01. For the experiment explained in Section 4.1, the networks predicting the outcome distribution from proxies use a learning rate of 1. Network towers have two hidden layers with 40 and 20 units. The training buffer has a size of 1,000. We start training once we have 128 examples in the buffer and perform one gradient step with a batch size of 128 every four rounds. For the experiment explained in Section 4.2, the networks predicting the outcome distribution from proxies use a learning rate of 0.1. Network towers have two hidden layers with 20 and 10 units. The training buffer has a size of 3,000. We start training once we have 500 examples in the buffer and perform 20 gradient steps with a batch size of 128 every 1,000 rounds. |