Programming by Feedback

Authors: Marc Schoenauer, Riad Akrour, Michele Sebag, Jean-Christophe Souplet

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A proof of principle of the approach is proposed, showing that PF requires a handful of interactions in order to solve some discrete and continuous benchmark problems. [...] Section 4 provides a proof-of-concept of the approach, showing that PF requires a handful of interactions to solve state-of-art benchmark problems in simulation, and to achieve on-board programming of the Nao robot (Aldebaran, 2013). [...] 4. Experimental results
Researcher Affiliation Academia Riad Akrour RIAD.AKROUR@LRI.FR Marc Schoenauer MARC.SCHOENAUER@INRIA.FR Jean-Christophe Souplet JCSOUPLET@LRI.FR Michele Sebag MICHELE.SEBAG@INRIA.FR TAO, INRIA/CNRS/LRI, Universit e Paris-Sud, 91405 France
Pseudocode Yes Algorithm 1 Programming by Feedback
Open Source Code No The paper does not contain any explicit statement about releasing source code or provide a link to a code repository.
Open Datasets No The paper describes generating data for the gridworld and Nao robot problems (e.g., 'The transition model involves a 50% probability of staying motionless... It is estimated from 1,000 random triplets.' for gridworld), and uses simulators for cartpole and bicycle, rather than referencing a publicly available dataset with concrete access information (link, DOI, specific citation for dataset). While using benchmark problems, it does not provide access to the specific data used in their experiments.
Dataset Splits No The paper is focused on reinforcement learning tasks and does not describe standard training, validation, or test dataset splits in the context of static datasets. It refers to 'PF interactions' and 'runs' for evaluation, which is a different paradigm.
Hardware Specification Yes The computational time is less than 1 minute per run on a 2.4Ghz Intel processor for all problems except the Nao problem (10 mns).
Software Dependencies No The paper mentions software like LSPI and CMA-ES but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes The transition model involves a 50% probability of staying motionless (100% if the selected action would send the agent in the wall). It is estimated from 1,000 random triplets. The reward function (true utility w ) is shown in Fig. 2.(a). The core optimization component (Section 3.4) implements a vanilla policy iteration algorithm, with γ = .95. Time horizon is set to H = 300. Results are averaged over 21 runs. [...] The user s feedback is emulated using hyperparameter ME (the higher ME, the less competent the user); MA is the hyper-parameter of the user s noise model estimated by the active computer (the higher MA, the more the active computer underestimates the user s competence), with ME and MA ranging in {1, .5, .25} s.t. MA ME. [...] The demonstration length is 3,000. [...] The maximum demonstration length is 30,000 time steps