Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Predicting Decisions in Language Based Persuasion Games

Authors: Reut Apel, Ido Erev, Roi Reichart, Moshe Tennenholtz

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For this purpose, we conduct an online repeated interaction experiment. At each trial of the interaction, an informed expert aims to sell an uninformed decision-maker a vacation in a hotel, by sending her a review that describes the hotel. While the expert is exposed to several scored reviews, the decision-maker observes only the single review sent by the expert, and her payoﬀin case she chooses to take the hotel is a random draw from the review score distribution available to the expert only. The expert s payoﬀ, in turn, depends on the number of times the decision-maker chooses the hotel. We also compare the behavioral patterns in this experiment to the equivalent patterns in similar experiments where the communication is based on the numerical values of the reviews rather than the reviews text, and observe substantial diﬀerences which can be explained through an equilibrium analysis of the game. We consider a number of modeling approaches for our verbal communication setup, diﬀering from each other in the model type (deep neural network (DNN) vs. linear classiﬁer), the type of features used by the model (textual, behavioral or both) and the source of the textual features (DNN-based vs. hand-crafted). Our results demonstrate that given a preﬁx of the interaction sequence, our models can predict the future decisions of the decision-maker, particularly when a sequential modeling approach and hand-crafted textual features are applied. Further analysis of the hand-crafted textual features allows us to make initial observations about the aspects of text that drive decision making in our setup.
Researcher Affiliation	Academia	Reut Apel EMAIL Ido Erev EMAIL Roi Reichart EMAIL Moshe Tennenholtz EMAIL Faculty of Industrial Engineering and Management Technion Israel Institute of Technology, Israel
Pseudocode	No	The paper describes its methods and architectures through detailed textual explanations and diagrams (Figures 8 and 9), but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	1. Our code and data are available at: https://github.com/reutapel/ Predicting-Decisions-in-Language-Based-Persuasion-Games
Open Datasets	Yes	1. Our code and data are available at: https://github.com/reutapel/ Predicting-Decisions-in-Language-Based-Persuasion-Games
Dataset Splits	Yes	Recall from Section 4.1 that we have collected a train-validation set of 408 examples and a separate test set of 101 examples. Each example in each set consists of a ten-trial game, and the sets diﬀer in their hotel (and hence also review) sets. We break each example in each of the sets into 10 diﬀerent examples, such that the ﬁrst pr {0, 1, . . . , 9} trials serve as a preﬁx and the remaining sf = 10 pr trials serve as a suﬃx. This yields a total of 4080 training examples and 1010 test examples. We employ a six-fold cross-validation protocol in order to tune the hyper-parameters of each model. For this purpose, we split the 408 (expert, decision-maker) pairs of the train-validation set into six subsets, such that each subset consists of 68 pairs. As described above, each decision sequence is translated into ten examples, each with a diﬀerent preﬁx size, resulting in 680 examples in each subset. In each fold, we select one subset for development and the remaining ﬁve subsets serve for training.
Hardware Specification	No	The paper mentions training DNNs and using PyTorch, which typically implies the use of GPUs, but it does not provide any specific details about the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions using the 'sklearn package', 'Allen NLP software package', and 'Pytorch' for implementation, as well as the 'ADAM optimization algorithm' and 'Hugging Face PyTorch Pretrained BERT Git Hub repository'. However, specific version numbers for these software dependencies are not provided in the text.
Experiment Setup	Yes	For all DNNs, we use Re LU as the activation function for all internal layers, and we tune the dropout parameter (0.0, 0.1, 0.2, 0.3), such that the same dropout parameter was used in the LSTM and Transformer models, as well as in the linear layers placed on top of these models. Training is carried out for 100 epochs with early stopping, and a batch size of 10 in the LSTM-based models and 9 in the Transformer-based models. Each batch consisted of all the examples of one decision-maker. We use a diﬀerent batch size for each model, since we did not feed the Transformer with examples with preﬁx of size 0, as mentioned in Section 6.2, and we still want to have examples of only one decision-maker in each batch. We use the ADAM optimization algorithm (Kingma & Ba, 2015) with its default parameters as implemented in Pytorch: learning rate=1e 03, fuzz factor ϵ = 1e 08, and learning rate decay over each update=0.0.