Q-LDA: Uncovering Latent Patterns in Text-based Sequential Decision Processes

Authors: Jianshu Chen, Chong Wang, Lin Xiao, Ji He, Lihong Li, Li Deng

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that our proposed method not only provides a viable mechanism to uncover latent patterns in decision processes, but also obtains state-of-the-art performance in these text games. In this section, we use two text games from [11] to evaluate our proposed model and demonstrate the idea of interpreting the decision making processes: (i) Saving John and (ii) Machine of Death. Table 1 summarize the means and standard deviations of the rewards on the two games.
Researcher Affiliation Industry Microsoft Research, Redmond, WA, USA {jianshuc,lin.xiao}@microsoft.com Google Inc., Kirkland, WA, USA {chongw,lihong}@google.com Citadel LLC, Seattle/Chicago, USA {Ji.He,Li.Deng}@citadel.com
Pseudocode Yes Algorithm 1 The training algorithm by mirror descent back propagation. Algorithm 2 The recursive MAP inference for one episode.
Open Source Code No The paper does not explicitly state that the authors' implementation code for Q-LDA is open-sourced or provide a link to it. Footnote 5 refers to the simulators (text games) which are third-party.
Open Datasets Yes In this section, we use two text games from [11] to evaluate our proposed model and demonstrate the idea of interpreting the decision making processes: (i) Saving John and (ii) Machine of Death (see Appendix C for a brief introduction of the two games). The simulators are obtained from https://github.com/jvking/text-games
Dataset Splits No The paper describes data collection for experience replay and states that results are not evaluated on the training dataset, but it does not specify explicit train/validation/test splits or percentages for the datasets used.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes For example, at each m-th experience-replay learning (see Algorithm 1), we use the softmax action selection rule [21, pp.30 31] as the exploration policy to collect data (see Appendix E.3 for more details). We collect M = 200 episodes of data (about 3K time steps in Saving John and 16K in Machine of Death ) at each of D = 20 experience replays, which amounts to a total of 4, 000 episodes. At each experience replay, we update the model with 10 epochs before the next replay.