reproducibilityindex.ai

Exponential Smoothing for Off-Policy Learning

Authors: Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the relevance of our approach and its favorable performance through a set of learning tasks. Finally, we show in Section 6 that our approach enjoys favorable performance. We briefly present our experiments. More details and discussions can be found in Appendix D. We consider the standard supervised-to-bandit conversion (Agarwal et al., 2014) where we transform a supervised training set S TR n to a logged bandit data Dn as described in Algorithm 1 in Appendix D.1. Here the action space A is the label set and the context space X is the input space. Then, Dn is used to train our policies. After that, we evaluate the reward of the learned policies on the supervised test set S TS n TS as described in Algorithm 2 in Appendix D.1. Roughly speaking, the resulting reward quantifies the ability of the learned policy to predict the true labels of the inputs in the test set. This is our performance metric; the higher the better. We use 4 image classification datasets MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), EMNIST (Cohen et al., 2017) and CIFAR100 (Krizhevsky et al., 2009).
Researcher Affiliation	Collaboration	1CREST, ENSAE, IP Paris, France 2Criteo AI Lab, Paris, France.
Pseudocode	Yes	Algorithm 1 Supervised-to-bandit: creating logged data Algorithm 2 Supervised-to-bandit: testing policies
Open Source Code	No	The paper states 'Refer to Appendix D to reproduce our experiments.' but does not provide a specific repository link or explicit statement about the release of source code for the methodology.
Open Datasets	Yes	We use 4 image classification datasets MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), EMNIST (Cohen et al., 2017) and CIFAR100 (Krizhevsky et al., 2009).
Dataset Splits	No	The paper mentions using a 'training set' and 'test set' and that 'µ0 are learned on the 5% portion of data' for the logging policy, but it does not specify a separate validation dataset split or a methodology for creating one for the main policy training.
Hardware Specification	No	The paper describes using a 'Res Net-50 network' for feature extraction but does not provide specific hardware details like GPU/CPU models or other computer specifications used for running the experiments.
Software Dependencies	No	The paper mentions using 'Adam (Kingma & Ba, 2014)' as an optimizer, but it does not provide specific version numbers for any software libraries or dependencies used in the experiments.
Experiment Setup	Yes	The policies are trained using Adam (Kingma & Ba, 2014) with a learning rate of 0.1 for 20 epochs. In all our experiments, we set S = 32. Here we fix τ = 1/ 4 n 0.06 and α = 1 1/ 4 n 0.94.