Unsupervised Training Sequence Design: Efficient and Generalizable Agent Training
Authors: Wenjun Li, Pradeep Varakantham
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically validate the effectiveness of the UTSD framework and demonstrate the transferability of the meta-teacher by comparing it to a set of leading baselines in UED: Domain Randomization (DR), PAIRED, PLR , and ACCEL. We conduct experiments on three popular yet distinct benchmarks in UED: Bit-Flipping, Lunar-Lander, and Minigrid. |
| Researcher Affiliation | Academia | Wenjun Li, Pradeep Varakantham Singapore Management University wjli.2020@phdcs.smu.edu.sg, pradeepv@smu.edu.sg |
| Pseudocode | Yes | Algorithm 1: Train meta-teacher |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | The Bit-Flipping environment, introduced by (Andrychowicz et al. 2017), is widely used in RL for its efficiency. |
| Dataset Splits | Yes | Specifically, our approach makes two key contributions: 1. A scalable agent policy encoding method, which can help the teacher in UTSD closely track the student s overall ability and behaviors and consequently design efficient training sequences with finite length. 2. Train a generalizable teacher that can rapidly adapt to unseen students with various learning patterns and capabilities by employing the context-based meta-RL approach. Student Policy Encoding In this section, we elaborate on how to collect a set of diverse environments regarding the student agent policy behaviors with the Quality Diversity method. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions various algorithms like DQN (Mnih et al. 2013), PPO (Schulman et al. 2017), SAC (Haarnoja et al. 2018), and PEARL (Rakelly et al. 2019) but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | In our experiments, the maximum training sequence length is set to 12 and the training amount on each task is fixed at 5k steps. |