Learning from a Learner

Authors: Alexis Jacq, Matthieu Geist, Ana Paiva, Olivier Pietquin

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the genericity of our method by observing agents implementing various reinforcement learning algorithms. Finally, we show that, on both discrete and continuous state/action tasks, the observer s performance (that optimizes the recovered reward) can surpass those of the observed learner.
Researcher Affiliation Collaboration 1Google Brain, Paris, France 2INESC-ID, IST, University of Lisbon.
Pseudocode Yes Algorithm 1 Recovering trajectory-consistent reward
Open Source Code No The paper does not provide a specific link or explicit statement about the release of its source code.
Open Datasets Yes To evaluate how our approach holds when dealing with large dimensions, we use the same experimental setting on continuous control tasks taken from the Open AI gym benchmark suite (Brockman et al., 2016).
Dataset Splits No The paper describes training procedures and parameters, but it does not specify explicit training/validation/test dataset splits with percentages, sample counts, or citations to predefined splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments.
Software Dependencies No The paper mentions software components and algorithms like 'Open AI gym', 'PPO', and 'Adam gradient descent' but does not specify their version numbers for reproducibility.
Experiment Setup Yes We use a discount factor γ = 0.96 and a trade-off factor α = 0.3. ... we use Adam gradient descent (Kingma & Ba, 2014) with learning rate 1e 3. ... The algorithm is run by modelling SPI with αmodel = 0.7. We use 1000 steps for the policy regressions, 100 steps for the KL divergence regressions, 3000 steps for the reward initialization and 1000 steps for the reward consistency regression.