Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Predicting Decisions in Language Based Persuasion Games

Authors: Reut Apel, Ido Erev, Roi Reichart, Moshe Tennenholtz

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For this purpose, we conduct an online repeated interaction experiment. At each trial of the interaction, an informed expert aims to sell an uninformed decision-maker a vacation in a hotel, by sending her a review that describes the hotel. While the expert is exposed to several scored reviews, the decision-maker observes only the single review sent by the expert, and her payoffin case she chooses to take the hotel is a random draw from the review score distribution available to the expert only. The expert s payoff, in turn, depends on the number of times the decision-maker chooses the hotel. We also compare the behavioral patterns in this experiment to the equivalent patterns in similar experiments where the communication is based on the numerical values of the reviews rather than the reviews text, and observe substantial differences which can be explained through an equilibrium analysis of the game. We consider a number of modeling approaches for our verbal communication setup, differing from each other in the model type (deep neural network (DNN) vs. linear classifier), the type of features used by the model (textual, behavioral or both) and the source of the textual features (DNN-based vs. hand-crafted). Our results demonstrate that given a prefix of the interaction sequence, our models can predict the future decisions of the decision-maker, particularly when a sequential modeling approach and hand-crafted textual features are applied. Further analysis of the hand-crafted textual features allows us to make initial observations about the aspects of text that drive decision making in our setup.
Researcher Affiliation Academia Reut Apel EMAIL Ido Erev EMAIL Roi Reichart EMAIL Moshe Tennenholtz EMAIL Faculty of Industrial Engineering and Management Technion Israel Institute of Technology, Israel
Pseudocode No The paper describes its methods and architectures through detailed textual explanations and diagrams (Figures 8 and 9), but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes 1. Our code and data are available at: https://github.com/reutapel/ Predicting-Decisions-in-Language-Based-Persuasion-Games
Open Datasets Yes 1. Our code and data are available at: https://github.com/reutapel/ Predicting-Decisions-in-Language-Based-Persuasion-Games
Dataset Splits Yes Recall from Section 4.1 that we have collected a train-validation set of 408 examples and a separate test set of 101 examples. Each example in each set consists of a ten-trial game, and the sets differ in their hotel (and hence also review) sets. We break each example in each of the sets into 10 different examples, such that the first pr {0, 1, . . . , 9} trials serve as a prefix and the remaining sf = 10 pr trials serve as a suffix. This yields a total of 4080 training examples and 1010 test examples. We employ a six-fold cross-validation protocol in order to tune the hyper-parameters of each model. For this purpose, we split the 408 (expert, decision-maker) pairs of the train-validation set into six subsets, such that each subset consists of 68 pairs. As described above, each decision sequence is translated into ten examples, each with a different prefix size, resulting in 680 examples in each subset. In each fold, we select one subset for development and the remaining five subsets serve for training.
Hardware Specification No The paper mentions training DNNs and using PyTorch, which typically implies the use of GPUs, but it does not provide any specific details about the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions using the 'sklearn package', 'Allen NLP software package', and 'Pytorch' for implementation, as well as the 'ADAM optimization algorithm' and 'Hugging Face PyTorch Pretrained BERT Git Hub repository'. However, specific version numbers for these software dependencies are not provided in the text.
Experiment Setup Yes For all DNNs, we use Re LU as the activation function for all internal layers, and we tune the dropout parameter (0.0, 0.1, 0.2, 0.3), such that the same dropout parameter was used in the LSTM and Transformer models, as well as in the linear layers placed on top of these models. Training is carried out for 100 epochs with early stopping, and a batch size of 10 in the LSTM-based models and 9 in the Transformer-based models. Each batch consisted of all the examples of one decision-maker. We use a different batch size for each model, since we did not feed the Transformer with examples with prefix of size 0, as mentioned in Section 6.2, and we still want to have examples of only one decision-maker in each batch. We use the ADAM optimization algorithm (Kingma & Ba, 2015) with its default parameters as implemented in Pytorch: learning rate=1e 03, fuzz factor ϵ = 1e 08, and learning rate decay over each update=0.0.