A Tractable Inference Perspective of Offline RL
Authors: Xuejie Liu, Anji Liu, Guy Van den Broeck, Yitao Liang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, Trifle achieves 7 state-of-the-art scores and the highest average scores in 9 Gym-Mu Jo Co benchmarks against strong baselines. Further, Trifle significantly outperforms prior approaches in stochastic environments and safe RL tasks with minimum algorithmic modifications. |
| Researcher Affiliation | Academia | 1Institute for Artificial Intelligence, Peking University 2Computer Science Department, University of California, Los Angeles 3School of Intelligence Science and Technology, Peking University |
| Pseudocode | Yes | The main algorithm is illustrated in Algorithm 1, where we take the current state st as well as the past trajectory τ<t as input, utilize the specified value estimate fv as a heuristic to guide beam search, and output the best trajectory. (Referring to Algorithm 1) |
| Open Source Code | Yes | Our code is available at https://github.com/liebenxj/Trifle.git |
| Open Datasets | Yes | Empirically, Trifle achieves 7 state-of-the-art scores and the highest average scores in 9 Gym-Mu Jo Co benchmarks against strong baselines. Further, Trifle significantly outperforms prior approaches in stochastic environments and safe RL tasks with minimum algorithmic modifications. |
| Dataset Splits | No | The paper refers to 'training phase' and 'evaluation phase' for the model, but does not explicitly detail a data split for validation (e.g., percentage or count). |
| Hardware Specification | No | The paper mentions training 'on one GPU' but does not specify the make, model, or any other specific hardware details (e.g., 'It only takes 30-60 minutes (~20s per epoch, 100-200 epochs) to train a PC on one GPU'). |
| Software Dependencies | No | The paper mentions software components like 'GPTs', 'diffusion models', 'Trajectory Transformer', 'Decision Transformer', and 'BERT-like Transformer', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Beam search maintains a set of N (incomplete) sequences each starting as an empty sequence. For ease of presentation, we assume the current time step is 0. At every time step t, beam search replicates each of the N actions sequences into λ Z+ copies and appends an action at to every sequence. Specifically, for every partial action sequence a<t, we sample an action following p(at|s0, a<t, E[Vt] v), where Vt can be either the single-step or the multi-step estimate depending on the task. Now that we have λ N trajectories in total, the next step is to evaluate their expected return, which can be computed exactly using the PC (see Sec. 4.2). The N-best action sequences are kept and proceed to the next time step. After repeating this procedure for H time steps, we return the best action sequence. The first action in the sequence is used to interact with the environment. Please refer to Appx. C for detailed descriptions of the algorithm. |