Programming by Feedback
Authors: Marc Schoenauer, Riad Akrour, Michele Sebag, Jean-Christophe Souplet
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A proof of principle of the approach is proposed, showing that PF requires a handful of interactions in order to solve some discrete and continuous benchmark problems. [...] Section 4 provides a proof-of-concept of the approach, showing that PF requires a handful of interactions to solve state-of-art benchmark problems in simulation, and to achieve on-board programming of the Nao robot (Aldebaran, 2013). [...] 4. Experimental results |
| Researcher Affiliation | Academia | Riad Akrour RIAD.AKROUR@LRI.FR Marc Schoenauer MARC.SCHOENAUER@INRIA.FR Jean-Christophe Souplet JCSOUPLET@LRI.FR Michele Sebag MICHELE.SEBAG@INRIA.FR TAO, INRIA/CNRS/LRI, Universit e Paris-Sud, 91405 France |
| Pseudocode | Yes | Algorithm 1 Programming by Feedback |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | No | The paper describes generating data for the gridworld and Nao robot problems (e.g., 'The transition model involves a 50% probability of staying motionless... It is estimated from 1,000 random triplets.' for gridworld), and uses simulators for cartpole and bicycle, rather than referencing a publicly available dataset with concrete access information (link, DOI, specific citation for dataset). While using benchmark problems, it does not provide access to the specific data used in their experiments. |
| Dataset Splits | No | The paper is focused on reinforcement learning tasks and does not describe standard training, validation, or test dataset splits in the context of static datasets. It refers to 'PF interactions' and 'runs' for evaluation, which is a different paradigm. |
| Hardware Specification | Yes | The computational time is less than 1 minute per run on a 2.4Ghz Intel processor for all problems except the Nao problem (10 mns). |
| Software Dependencies | No | The paper mentions software like LSPI and CMA-ES but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | The transition model involves a 50% probability of staying motionless (100% if the selected action would send the agent in the wall). It is estimated from 1,000 random triplets. The reward function (true utility w ) is shown in Fig. 2.(a). The core optimization component (Section 3.4) implements a vanilla policy iteration algorithm, with γ = .95. Time horizon is set to H = 300. Results are averaged over 21 runs. [...] The user s feedback is emulated using hyperparameter ME (the higher ME, the less competent the user); MA is the hyper-parameter of the user s noise model estimated by the active computer (the higher MA, the more the active computer underestimates the user s competence), with ME and MA ranging in {1, .5, .25} s.t. MA ME. [...] The demonstration length is 3,000. [...] The maximum demonstration length is 30,000 time steps |