reproducibilityindex.ai

Towards Long-delayed Sparsity: Learning a Better Transformer through Reward Redistribution

Authors: Tianchen Zhu, Yue Qiu, Haoyi Zhou, Jianxin Li

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively evaluate the proposed method on various benchmarks and demonstrate an overwhelming performance improvement under long-delayed settings. and 4 Experiments This section assesses the effectiveness of our approach across various offline RL benchmarks, highlighting the benefits of utilizing redistributed rewards in long-delayed settings.
Researcher Affiliation	Academia	1Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University 2School of Computer Science and Engineering, Beihang University 3School of Software, Beihang University 4Shenyuan Honors College, Beihang University {zhutc,qiuyue,zhouhy,lijx}@act.buaa.edu.cn
Pseudocode	Yes	Algorithm 1: Bi-level Optimization of Reward Redistribution
Open Source Code	Yes	The source code is available at https://github.com/catezi/DTRD.
Open Datasets	Yes	We evaluated our method on both discrete and continuous control tasks. The discrete control tasks, including Atari [Bellemare et al., 2015] and Minigrid [Chevalier Boisvert et al., 2018], involve high-dimensional observation spaces and require long-term reward redistribution. On the other hand, the continuous control tasks, such as Open AI Gym Mujoco [Brockman et al., 2016], Maze2d [Fu et al., 2020], and Franka Kitchen [Fu et al., 2020], not only have extremely delayed rewards but also require fine-grained continuous control.
Dataset Splits	No	The paper states: 'Based on this, we divided all the trajectory data S into two categories: training set Strain and validation set Sval.' but does not provide specific percentages, sample counts, or explicit instructions for reproducing these dataset splits.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9').
Experiment Setup	Yes	The paper states: 'The context length during the evaluation can be shorter than the context length used for training.' (Section 2.2) and 'where λ is a hyper-parameter to control the numerical scale balance' (Section 3.3). Appendix C.3 Implementation Details further specifies: 'We used an AdamW optimizer with a learning rate of 6e-4 and a weight decay of 1e-4. The context length is 20 for all environments.'