reproducibilityindex.ai

A Fine-Tuning Approach to Belief State Modeling

Authors: Samuel Sokota, Hengyuan Hu, David J Wu, J Zico Kolter, Jakob Nicolaus Foerster, Noam Brown

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We divide our experimental investigation into two parts. In the ﬁrst part, we explore the extent to which BFT can improve the belief quality of a parametric model in HMMs, POMDPs, and FOSGs. In the second part, we explore the idea of performing search on top of beliefs from BFT.
Researcher Affiliation	Collaboration	Samuel Sokota Carnegie Mellon University ssokota@andrew.cmu.edu Hengyuan Hu Meta AI hengyuan@fb.com David J. Wu Meta AI dwu@fb.com J. Zico Kolter Carnegie Mellon University zkolter@cs.cmu.edu Jakob Foerster Oxford University jakob.foerster@eng.ox.ac.uk Noam Brown Meta AI noambrown@fb.com
Pseudocode	Yes	In section 3.3 'BELIEF FINE-TUNING' and A.3 'DESCRIPTION OF ALGORITHM IMPLEMENTATION', the paper provides numbered steps for the BFT procedure, which are structured like pseudocode.
Open Source Code	Yes	The codebase for our experiments can be found at https://github.com/ facebookresearch/off-belief-learning.
Open Datasets	Yes	We use the cooperative card game Hanabi (Bard et al., 2020) for these experiments.
Dataset Splits	No	The paper mentions training models but does not provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, or test sets).
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions several models and algorithms used (e.g., R2D2, Seq2Seq, RLSearch, SPARTA) but does not list specific software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch versions).
Experiment Setup	Yes	For our experimental setup, we trained policies using independent R2D2 (Kapturowski et al., 2019) that collectively score around 24 out of 25 points, using the same hyperparameters as those found in (Hu & Foerster, 2020). We then trained a Seq2Seq model (Sutskever et al., 2014)... For the Seq2Seq model, we used the same hyperparameters as those found in (Hu et al., 2021). For BFT, we ﬁne-tuned the encoder of the belief network for 10,000 gradient steps at each decision-point using the same hyperparameters that were used for ofﬂine training.