Q-LDA: Uncovering Latent Patterns in Text-based Sequential Decision Processes
Authors: Jianshu Chen, Chong Wang, Lin Xiao, Ji He, Lihong Li, Li Deng
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that our proposed method not only provides a viable mechanism to uncover latent patterns in decision processes, but also obtains state-of-the-art performance in these text games. In this section, we use two text games from [11] to evaluate our proposed model and demonstrate the idea of interpreting the decision making processes: (i) Saving John and (ii) Machine of Death. Table 1 summarize the means and standard deviations of the rewards on the two games. |
| Researcher Affiliation | Industry | Microsoft Research, Redmond, WA, USA {jianshuc,lin.xiao}@microsoft.com Google Inc., Kirkland, WA, USA {chongw,lihong}@google.com Citadel LLC, Seattle/Chicago, USA {Ji.He,Li.Deng}@citadel.com |
| Pseudocode | Yes | Algorithm 1 The training algorithm by mirror descent back propagation. Algorithm 2 The recursive MAP inference for one episode. |
| Open Source Code | No | The paper does not explicitly state that the authors' implementation code for Q-LDA is open-sourced or provide a link to it. Footnote 5 refers to the simulators (text games) which are third-party. |
| Open Datasets | Yes | In this section, we use two text games from [11] to evaluate our proposed model and demonstrate the idea of interpreting the decision making processes: (i) Saving John and (ii) Machine of Death (see Appendix C for a brief introduction of the two games). The simulators are obtained from https://github.com/jvking/text-games |
| Dataset Splits | No | The paper describes data collection for experience replay and states that results are not evaluated on the training dataset, but it does not specify explicit train/validation/test splits or percentages for the datasets used. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | For example, at each m-th experience-replay learning (see Algorithm 1), we use the softmax action selection rule [21, pp.30 31] as the exploration policy to collect data (see Appendix E.3 for more details). We collect M = 200 episodes of data (about 3K time steps in Saving John and 16K in Machine of Death ) at each of D = 20 experience replays, which amounts to a total of 4, 000 episodes. At each experience replay, we update the model with 10 epochs before the next replay. |