Learning from a Learner
Authors: Alexis Jacq, Matthieu Geist, Ana Paiva, Olivier Pietquin
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the genericity of our method by observing agents implementing various reinforcement learning algorithms. Finally, we show that, on both discrete and continuous state/action tasks, the observer s performance (that optimizes the recovered reward) can surpass those of the observed learner. |
| Researcher Affiliation | Collaboration | 1Google Brain, Paris, France 2INESC-ID, IST, University of Lisbon. |
| Pseudocode | Yes | Algorithm 1 Recovering trajectory-consistent reward |
| Open Source Code | No | The paper does not provide a specific link or explicit statement about the release of its source code. |
| Open Datasets | Yes | To evaluate how our approach holds when dealing with large dimensions, we use the same experimental setting on continuous control tasks taken from the Open AI gym benchmark suite (Brockman et al., 2016). |
| Dataset Splits | No | The paper describes training procedures and parameters, but it does not specify explicit training/validation/test dataset splits with percentages, sample counts, or citations to predefined splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions software components and algorithms like 'Open AI gym', 'PPO', and 'Adam gradient descent' but does not specify their version numbers for reproducibility. |
| Experiment Setup | Yes | We use a discount factor γ = 0.96 and a trade-off factor α = 0.3. ... we use Adam gradient descent (Kingma & Ba, 2014) with learning rate 1e 3. ... The algorithm is run by modelling SPI with αmodel = 0.7. We use 1000 steps for the policy regressions, 100 steps for the KL divergence regressions, 3000 steps for the reward initialization and 1000 steps for the reward consistency regression. |