Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalized Decision Transformer for Offline Hindsight Information Matching

Authors: Hiroki Furuta, Yutaka Matsuo, Shixiang Shane Gu

ICLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment on the Open AI Gym, Mu Jo Co tasks (Half Cheetah, Hopper, Walker2d, Ant-v3), a common benchmark for continuous control (Brockman et al., 2016; Todorov et al., 2012). Through the experiments, we use medium-expert datasets in D4RL (Fu et al., 2020) to ensure the decent data coverage. We sort all the trajectories by their cumulative rewards, hold out five best trajectories and five 50 percentile trajectories as a test set (10 trajectories in total), and use the rest as a train set. We report the results averaged over 20 rollouts every 4 random seed.
Researcher Affiliation Collaboration Hiroki Furuta The University of Tokyo EMAIL Yutaka Matsuo The University of Tokyo Shixiang Shane Gu Google Research
Pseudocode Yes See Algorithm 1 (in Appendix F) for the full pseudocode.
Open Source Code Yes We share our implementation to ensure the reproducibility8. 8https://github.com/frt03/generalized_dt
Open Datasets Yes Through the experiments, we use medium-expert datasets in D4RL (Fu et al., 2020) to ensure the decent data coverage.
Dataset Splits No The paper states: 'We sort all the trajectories by their cumulative rewards, hold out five best trajectories and five 50 percentile trajectories as a test set (10 trajectories in total), and use the rest as a train set.' This specifies a train/test split, but no explicit mention of a separate validation set for hyperparameter tuning or early stopping.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as CPU or GPU models, or memory specifications.
Software Dependencies No The paper mentions 'pytorch implementation' in Appendix E.4 but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We implement Categorical Decision Transformer and Bi-directional Decision Transformer, built upon the official codebase released by Chen et al. (2021a) (https://github.com/kzl/decision-transformer). We follow the most of hyperparameters as they did (Table 6).