Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting
Authors: Xiong-Hui Chen, Ziyan Wang, Yali Du, Shengyi Jiang, Meng Fang, Yang Yu, Jun Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiment, URI s policy achieves a minimum of 44% net winning rate against GPT-based agents without any real data. In the much more complex football game, URI s policy beat the built-in AIs with a 37% winning rate while GPT-based agents can only achieve a 6% winning rate. |
| Researcher Affiliation | Academia | Xiong-Hui Chen1,+, , Ziyan Wang2, , Yali Du2, , Shengyi Jiang5, Meng Fang4, Yang Yu1, , Jun Wang3, 1 National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China 2 Cooperative AI Lab, Department of Informatics, King s College London 3 AI Centre, Department of Computer Science, University College London 4 University of Liverpool 5 The University of Hong Kong |
| Pseudocode | Yes | Algorithm 4 URI (Understanding, Rehearsing, and Introspecting) |
| Open Source Code | No | The project page: plfb-football.github.io. We commit to open-source the code that can reproduce all the experiment results after the paper is published. This paper does not release new assets. |
| Open Datasets | Yes | For football, we collect the textbook dataset from the open-source book dataset Red Pajama-1T [57], focusing on titles and abstracts related to football or soccer. After filtering, we obtain a curated set of ninety books closely aligned with the domain. [17] Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/Red Pajama-Data. |
| Dataset Splits | No | No explicit training/validation/test dataset splits with percentages or sample counts were found. The paper mentions sampling initial states and imagining transitions for policy distillation but not a separate validation set. |
| Hardware Specification | Yes | All experiments were conducted on a high-performance computing (HPC) system featuring 128 Intel Xeon processors running at 2.2 GHz, 5 TB of memory, an Nvidia A100 PCIE-40G GPU, and two Nvidia A30 GPUs. |
| Software Dependencies | No | We implement CIQL based on the open source codes of CQL in d3rlpy [62]... Since this step requires a strong understanding of the code, we use GPT-4 instead of GPT-3.5 as the LLM implementation. No specific versions for d3rlpy or other general software dependencies were provided. |
| Experiment Setup | Yes | Table 5: URI Hyperparameters. Parameter Value pieces of knowledge for each time of knowledge aggregation Nagg 4 retrieved segments Nret 15 learning rate (λ) 0.0001 weight of transition ηT 0.5 weight of reward penalties ηR 0.5 horizon of rollout H 10 weight of conservative loss α 60 number of ensemble Nens 20 |