Policy Learning Using Weak Supervision

Authors: Jingkang Wang, Hongyi Guo, Zhaowei Zhu, Yang Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements, especially when the complexity or the noise of the learning environments is high.
Researcher Affiliation Academia University of Toronto1, Vector Institute2, Northwestern University3, UC Santa Cruz4
Pseudocode Yes Algorithm 1 Peer policy co-training (Peer CT)
Open Source Code Yes Code is online available at: https://github.com/wangjksjtu/Peer PL.
Open Datasets No The paper mentions evaluating on "control and Atari games" which are standard environments, and for BC, generating "100 trajectories for each environment" but does not provide specific access information (link, citation, etc.) for a publicly available dataset of these trajectories, nor does it refer to established benchmark datasets as specific files.
Dataset Splits No The paper does not provide specific dataset split information (percentages, counts, or citations) for training, validation, or test sets.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, cloud instances) used for running its experiments.
Software Dependencies No The paper mentions various algorithms (DQN, DDQN, DDPG, PPO) but does not provide specific version numbers for software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For Cart Pole-v0, it states training models for "10,000 steps" and repeating experiments "10 times with different random seeds". It mentions using "DDPG [57] with uniform noise" for Pendulum and discusses "CA coefficient" and "step size β for policy update" in Algorithm 1, providing concrete setup details.