Towards Long-delayed Sparsity: Learning a Better Transformer through Reward Redistribution

Authors: Tianchen Zhu, Yue Qiu, Haoyi Zhou, Jianxin Li

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate the proposed method on various benchmarks and demonstrate an overwhelming performance improvement under long-delayed settings. and 4 Experiments This section assesses the effectiveness of our approach across various offline RL benchmarks, highlighting the benefits of utilizing redistributed rewards in long-delayed settings.
Researcher Affiliation Academia 1Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University 2School of Computer Science and Engineering, Beihang University 3School of Software, Beihang University 4Shenyuan Honors College, Beihang University {zhutc,qiuyue,zhouhy,lijx}@act.buaa.edu.cn
Pseudocode Yes Algorithm 1: Bi-level Optimization of Reward Redistribution
Open Source Code Yes The source code is available at https://github.com/catezi/DTRD.
Open Datasets Yes We evaluated our method on both discrete and continuous control tasks. The discrete control tasks, including Atari [Bellemare et al., 2015] and Minigrid [Chevalier Boisvert et al., 2018], involve high-dimensional observation spaces and require long-term reward redistribution. On the other hand, the continuous control tasks, such as Open AI Gym Mujoco [Brockman et al., 2016], Maze2d [Fu et al., 2020], and Franka Kitchen [Fu et al., 2020], not only have extremely delayed rewards but also require fine-grained continuous control.
Dataset Splits No The paper states: 'Based on this, we divided all the trajectory data S into two categories: training set Strain and validation set Sval.' but does not provide specific percentages, sample counts, or explicit instructions for reproducing these dataset splits.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9').
Experiment Setup Yes The paper states: 'The context length during the evaluation can be shorter than the context length used for training.' (Section 2.2) and 'where λ is a hyper-parameter to control the numerical scale balance' (Section 3.3). Appendix C.3 Implementation Details further specifies: 'We used an AdamW optimizer with a learning rate of 6e-4 and a weight decay of 1e-4. The context length is 20 for all environments.'