Policy Learning Using Weak Supervision
Authors: Jingkang Wang, Hongyi Guo, Zhaowei Zhu, Yang Liu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements, especially when the complexity or the noise of the learning environments is high. |
| Researcher Affiliation | Academia | University of Toronto1, Vector Institute2, Northwestern University3, UC Santa Cruz4 |
| Pseudocode | Yes | Algorithm 1 Peer policy co-training (Peer CT) |
| Open Source Code | Yes | Code is online available at: https://github.com/wangjksjtu/Peer PL. |
| Open Datasets | No | The paper mentions evaluating on "control and Atari games" which are standard environments, and for BC, generating "100 trajectories for each environment" but does not provide specific access information (link, citation, etc.) for a publicly available dataset of these trajectories, nor does it refer to established benchmark datasets as specific files. |
| Dataset Splits | No | The paper does not provide specific dataset split information (percentages, counts, or citations) for training, validation, or test sets. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, cloud instances) used for running its experiments. |
| Software Dependencies | No | The paper mentions various algorithms (DQN, DDQN, DDPG, PPO) but does not provide specific version numbers for software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For Cart Pole-v0, it states training models for "10,000 steps" and repeating experiments "10 times with different random seeds". It mentions using "DDPG [57] with uniform noise" for Pendulum and discusses "CA coefficient" and "step size β for policy update" in Algorithm 1, providing concrete setup details. |