Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CtD: Composition through Decomposition in Emergent Communication

Authors: Boaz Carmeli, Ron Meir, Yonatan Belinkov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method, termed Composition through Decomposition , involves two sequential training steps. In the Decompose step, the agents learn to decompose an image into basic concepts using a codebook acquired during interaction in a multi-target coordination game. Subsequently, in the Compose step, the agents employ this codebook to describe novel images by composing basic concepts into complex phrases. Remarkably, we observe cases where generalization in the Compose step is achieved zero-shot, without the need for additional training.
Researcher Affiliation	Academia	Boaz Carmeli Technion Israel Institute of Technology EMAIL, Ron Meir Technion Israel Institute of Technology EMAIL Yonatan Belinkov Technion Israel Institute of Technology EMAIL
Pseudocode	No	The paper describes the methodology through prose and figures (e.g., Figure 1 and Figure 2), but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks are provided.
Open Source Code	No	All datasets used in this work are either publicly available or can be easily generated using Python. A detailed description of each dataset is provided in Appendix B. Appendix A.3 offers a comprehensive explanation of the evaluation metrics, with Python implementations available online. Please refer to the respective papers for additional details. Appendix E details the codebook configuration and its hyperparameters. General training procedures and hyperparameter settings are outlined in Appendix G. All experiments were conducted on a single A100 GPU with 40 GB of RAM, and each experiment was completed in less than a day. The code for data preprocessing and all experiments will be made publicly available after the paper is accepted.
Open Datasets	Yes	Datasets We use five datasets in our experiments summarized in Table 1. THING (Carmeli et al., 2024) is a synthetic dataset... SHAPE (Kuhnle & Copestake, 2017) is a visual reasoning dataset... MNIST (Le Cun et al., 1998) is a dataset of handwritten digits... COCO (Lin et al., 2014a) is a dataset of real-world multi-object images... QRC is a synthetic image dataset introduced by us, employing two-dimensional QR codes to encode information akin to the THING dataset. We use this dataset to evaluate results on a non-compositional dataset. We provide more information on each of these datasets in Appendix B.
Dataset Splits	Yes	In the SINGLE-CONCEPT variant we split the data randomly into training, validation, and test sets. This approach aligns with our belief that agents cannot generalize across concepts; for example, learning the concepts Red and Triangle does not enable the agents to learn the concept Square. However, in the COMPOSITE-PHRASE variant, we ensure that there is no overlap between phrase labels in each split. Specifically, if a labeling phrase such as Red:Square appears in the test set, it does not appear in the training set. This segregation allows us to evaluate the extent to which agents have learned to generalize by composing known concepts in novel ways. In all our experiments we use 30,000 samples for training, 1000 for validation and 1000 for test.
Hardware Specification	Yes	All experiments were conducted on a single A100 GPU with 40 GB of RAM, and each experiment was completed in less than a day.
Software Dependencies	No	We run all experiments over a modified version of the Egg framework (Kharitonov et al., 2019).1 In our version the communication modules, zθ and zϕ, used by the sender and the receiver, respectively, are totally separated from their perceptual modules, which we term agents . See schematic illustration in Figure 2. This separation allow us to use the exact same communication modules for different games. The Gumbel-Softmax (GS), Quantized (QT), and coodbook-based (CB) protocols are implemented at the communication layer. For GS we use an implementation provided by the Egg framework, where we do not allow temperature to be learned, and set the straight-through estimator to False. For QT protocol we followed parameter recommendations from Carmeli et al. (2023) and use a binary quantization in all experiments. For CB we adapted the implementation code provided by Zheng & Vedaldi (2023) and a code from Oord et al. (2018).2 Both GS and QT uses a recurrent neural network (RNN) for generating multiple words within a message. Refer to Table 11 for details on specific RNN and other hyper-parameters. The paper mentions software tools like 'Egg framework' and 'Python implementations' for evaluation metrics, but it does not specify version numbers for these or other key libraries (e.g., PyTorch, TensorFlow, CUDA) that would be essential for replication.
Experiment Setup	Yes	Table 11: Agents hyper-parameters used for obtaining the results in Table 4. Batch Sender Sender Receiver Receiver Cell Sender Sender Receiver Receiver Size lr Targets Distr Targets Distr Type Hidden Embed Hidden Embed 10 0.0005 20 0 20 20 LSTM 100 500 100 500