reproducibilityindex.ai

Policy Learning Using Weak Supervision

Authors: Jingkang Wang, Hongyi Guo, Zhaowei Zhu, Yang Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements, especially when the complexity or the noise of the learning environments is high.
Researcher Affiliation	Academia	University of Toronto1, Vector Institute2, Northwestern University3, UC Santa Cruz4
Pseudocode	Yes	Algorithm 1 Peer policy co-training (Peer CT)
Open Source Code	Yes	Code is online available at: https://github.com/wangjksjtu/Peer PL.
Open Datasets	No	The paper mentions evaluating on "control and Atari games" which are standard environments, and for BC, generating "100 trajectories for each environment" but does not provide specific access information (link, citation, etc.) for a publicly available dataset of these trajectories, nor does it refer to established benchmark datasets as specific files.
Dataset Splits	No	The paper does not provide specific dataset split information (percentages, counts, or citations) for training, validation, or test sets.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, cloud instances) used for running its experiments.
Software Dependencies	No	The paper mentions various algorithms (DQN, DDQN, DDPG, PPO) but does not provide specific version numbers for software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For Cart Pole-v0, it states training models for "10,000 steps" and repeating experiments "10 times with different random seeds". It mentions using "DDPG [57] with uniform noise" for Pendulum and discusses "CA coefﬁcient" and "step size β for policy update" in Algorithm 1, providing concrete setup details.