A Fine-Tuning Approach to Belief State Modeling
Authors: Samuel Sokota, Hengyuan Hu, David J Wu, J Zico Kolter, Jakob Nicolaus Foerster, Noam Brown
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We divide our experimental investigation into two parts. In the first part, we explore the extent to which BFT can improve the belief quality of a parametric model in HMMs, POMDPs, and FOSGs. In the second part, we explore the idea of performing search on top of beliefs from BFT. |
| Researcher Affiliation | Collaboration | Samuel Sokota Carnegie Mellon University ssokota@andrew.cmu.edu Hengyuan Hu Meta AI hengyuan@fb.com David J. Wu Meta AI dwu@fb.com J. Zico Kolter Carnegie Mellon University zkolter@cs.cmu.edu Jakob Foerster Oxford University jakob.foerster@eng.ox.ac.uk Noam Brown Meta AI noambrown@fb.com |
| Pseudocode | Yes | In section 3.3 'BELIEF FINE-TUNING' and A.3 'DESCRIPTION OF ALGORITHM IMPLEMENTATION', the paper provides numbered steps for the BFT procedure, which are structured like pseudocode. |
| Open Source Code | Yes | The codebase for our experiments can be found at https://github.com/ facebookresearch/off-belief-learning. |
| Open Datasets | Yes | We use the cooperative card game Hanabi (Bard et al., 2020) for these experiments. |
| Dataset Splits | No | The paper mentions training models but does not provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, or test sets). |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions several models and algorithms used (e.g., R2D2, Seq2Seq, RLSearch, SPARTA) but does not list specific software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch versions). |
| Experiment Setup | Yes | For our experimental setup, we trained policies using independent R2D2 (Kapturowski et al., 2019) that collectively score around 24 out of 25 points, using the same hyperparameters as those found in (Hu & Foerster, 2020). We then trained a Seq2Seq model (Sutskever et al., 2014)... For the Seq2Seq model, we used the same hyperparameters as those found in (Hu et al., 2021). For BFT, we fine-tuned the encoder of the belief network for 10,000 gradient steps at each decision-point using the same hyperparameters that were used for offline training. |