Learning non-Markovian Decision-Making from State-only Sequences
Authors: Aoyang Qin, Feng Gao, Qing Li, Song-Chun Zhu, Sirui Xie
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of the proposed method in a prototypical path planning task with non-Markovian constraints and show that the learned model exhibits strong performances in challenging domains from the Mu Jo Co suite. (...) We also test the proposed modeling, learning, and computing method in Mu Jo Co, a domain with higher-dimensional state and action spaces, and achieve performance competitive to existing methods, even those that learn with action labels. |
| Researcher Affiliation | Academia | Aoyang Qin ,1,2 Feng Gao3 Qing Li2 Song-Chun Zhu1,2,4 Sirui Xie ,5 1 Department of Automation, Tsinghua University 2 Beijing Institute for General Artificial Intelligence (BIGAI) 3 Department of Statistics, UCLA 4 School of Artificial Intelligence, Peking University 5 Department of Computer Science, UCLA |
| Pseudocode | Yes | Algorithm 1: Lan MDP without importance sampling (...) Algorithm 2: Lan MDP with importance sampling |
| Open Source Code | Yes | Code and data are available at https://github.com/qayqaq/LanMDP |
| Open Datasets | No | The paper uses MuJoCo control tasks (Cartpole-v1, Reacher-v2, Swimmer-v3, Hopper-v2, Walker2d-v2) and generated its own demonstrations: 'We train an expert for each task using PPO [45]. They are then used to generate 10 trajectories for each task as demonstrations.' However, it does not provide access information (link, DOI, citation with author/year) to these generated datasets. |
| Dataset Splits | No | The paper states 'Results for context length 1 are illustrated through learning curves and a bar plot in Fig. 4. These learning curves are the average progress across 5 seeds.' and 'We report the mean of the best performance achieved by BC/BCO with five random seeds'. While this implies some form of splitting or cross-validation for robustness, explicit details about train/validation/test splits (e.g., percentages, sample counts, or references to predefined splits with citations) are not provided for reproducibility. |
| Hardware Specification | Yes | All benchmarking is performed using a single 3090Ti GPU and implemented using the Py Torch framework. |
| Software Dependencies | Yes | All benchmarking is performed using a single 3090Ti GPU and implemented using the Py Torch framework. (...) We leverage PPO [45] approach to train the expert policy... |
| Experiment Setup | Yes | Hyper-parameters are listed in Table 3. (...) The input and output dimensions are adapted to the state and action spaces in different tasks, and so are short-run sampling steps. Sequential contexts are extracted from stored episodic memory. The number of neurons in the input and hidden layer in the policy MLP varies according to the context length. We use replay buffers to store the self-interaction experiences for training the transition model offline. See Appendix D for detailed information on network architectures and hyper-parameters. |