Exponential Smoothing for Off-Policy Learning
Authors: Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the relevance of our approach and its favorable performance through a set of learning tasks. Finally, we show in Section 6 that our approach enjoys favorable performance. We briefly present our experiments. More details and discussions can be found in Appendix D. We consider the standard supervised-to-bandit conversion (Agarwal et al., 2014) where we transform a supervised training set S TR n to a logged bandit data Dn as described in Algorithm 1 in Appendix D.1. Here the action space A is the label set and the context space X is the input space. Then, Dn is used to train our policies. After that, we evaluate the reward of the learned policies on the supervised test set S TS n TS as described in Algorithm 2 in Appendix D.1. Roughly speaking, the resulting reward quantifies the ability of the learned policy to predict the true labels of the inputs in the test set. This is our performance metric; the higher the better. We use 4 image classification datasets MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), EMNIST (Cohen et al., 2017) and CIFAR100 (Krizhevsky et al., 2009). |
| Researcher Affiliation | Collaboration | 1CREST, ENSAE, IP Paris, France 2Criteo AI Lab, Paris, France. |
| Pseudocode | Yes | Algorithm 1 Supervised-to-bandit: creating logged data Algorithm 2 Supervised-to-bandit: testing policies |
| Open Source Code | No | The paper states 'Refer to Appendix D to reproduce our experiments.' but does not provide a specific repository link or explicit statement about the release of source code for the methodology. |
| Open Datasets | Yes | We use 4 image classification datasets MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), EMNIST (Cohen et al., 2017) and CIFAR100 (Krizhevsky et al., 2009). |
| Dataset Splits | No | The paper mentions using a 'training set' and 'test set' and that 'µ0 are learned on the 5% portion of data' for the logging policy, but it does not specify a separate validation dataset split or a methodology for creating one for the main policy training. |
| Hardware Specification | No | The paper describes using a 'Res Net-50 network' for feature extraction but does not provide specific hardware details like GPU/CPU models or other computer specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam (Kingma & Ba, 2014)' as an optimizer, but it does not provide specific version numbers for any software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | The policies are trained using Adam (Kingma & Ba, 2014) with a learning rate of 0.1 for 20 epochs. In all our experiments, we set S = 32. Here we fix τ = 1/ 4 n 0.06 and α = 1 1/ 4 n 0.94. |