Towards Long-delayed Sparsity: Learning a Better Transformer through Reward Redistribution
Authors: Tianchen Zhu, Yue Qiu, Haoyi Zhou, Jianxin Li
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate the proposed method on various benchmarks and demonstrate an overwhelming performance improvement under long-delayed settings. and 4 Experiments This section assesses the effectiveness of our approach across various offline RL benchmarks, highlighting the benefits of utilizing redistributed rewards in long-delayed settings. |
| Researcher Affiliation | Academia | 1Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University 2School of Computer Science and Engineering, Beihang University 3School of Software, Beihang University 4Shenyuan Honors College, Beihang University {zhutc,qiuyue,zhouhy,lijx}@act.buaa.edu.cn |
| Pseudocode | Yes | Algorithm 1: Bi-level Optimization of Reward Redistribution |
| Open Source Code | Yes | The source code is available at https://github.com/catezi/DTRD. |
| Open Datasets | Yes | We evaluated our method on both discrete and continuous control tasks. The discrete control tasks, including Atari [Bellemare et al., 2015] and Minigrid [Chevalier Boisvert et al., 2018], involve high-dimensional observation spaces and require long-term reward redistribution. On the other hand, the continuous control tasks, such as Open AI Gym Mujoco [Brockman et al., 2016], Maze2d [Fu et al., 2020], and Franka Kitchen [Fu et al., 2020], not only have extremely delayed rewards but also require fine-grained continuous control. |
| Dataset Splits | No | The paper states: 'Based on this, we divided all the trajectory data S into two categories: training set Strain and validation set Sval.' but does not provide specific percentages, sample counts, or explicit instructions for reproducing these dataset splits. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9'). |
| Experiment Setup | Yes | The paper states: 'The context length during the evaluation can be shorter than the context length used for training.' (Section 2.2) and 'where λ is a hyper-parameter to control the numerical scale balance' (Section 3.3). Appendix C.3 Implementation Details further specifies: 'We used an AdamW optimizer with a learning rate of 6e-4 and a weight decay of 1e-4. The context length is 20 for all environments.' |