Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalizing Skills with Semi-Supervised Reinforcement Learning

Authors: Chelsea Finn, Tianhe Yu, Justin Fu, Pieter Abbeel, Sergey Levine

ICLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTAL EVALUATION; We report the success rate of policies learned with each method in Table 1, and visualize the generalization performance in the 2-link reacher, cheetah, and obstacle tasks in Figure 3.
Researcher Affiliation Collaboration Berkeley AI Research (BAIR), University of California, Berkeley Open AI EMAIL
Pseudocode Yes Algorithm 1 Semi-Supervised Skill Generalization
Open Source Code Yes Code for reproducing the simulated experiments is available online1. 1The code is available at github.com/cbfinn/gps/tree/ssrl
Open Datasets No Thus, we define our own set of simulated control tasks for this paper, explicitly considering the types of variation that a robot might encounter in the real world.
Dataset Splits No The paper discusses 'labeled MDPs' and 'unlabeled MDPs' but does not provide explicit training/validation/test dataset splits with percentages or sample counts.
Hardware Specification No No specific hardware details (e.g., CPU/GPU models, memory) used for running experiments are mentioned.
Software Dependencies No The paper mentions 'Mu Jo Co simulator' and methods like 'mirror descent guided policy search (MDGPS)' but does not provide specific software names with version numbers for reproducibility.
Experiment Setup Yes For the non-visual tasks, the policy was represented using a neural network with 2 hidden layers of 40 units each. The vision task used 3 convolutional layers with 15 filters of size 5 5 each, followed by the spatial feature point transformation proposed by Levine et al. (2016), and lastly 3 fully-connected layers of 20 units each. The reward function architecture mirrored the architecture as the policy, but using a quadratic norm on the output, as done by Finn et al. (2016).