Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mixture of Latent Experts Using Tensor Products

Authors: Zhan Su, Fengran Mo, Prayag Tiwari, Benyou Wang, Qiuchi Li, Jian-Yun Nie, Jakob Grue Simonsen

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results from the multi-task T0-benchmark demonstrate that: 1) all latent-expert approaches surpass the corresponding dense approaches, highlighting the potential of modular language models to mitigate negative inference in multi-task learning and deliver superior outcomes. 2) Tensor Poly-I achieves higher parameter efficiency in adaptation and outperforms other modular LMs, which shows the potential of our approach in multi-task transfer learning 1. To evaluate the effectiveness and parameter efficiency of our approach, we apply our methods against a series of competitive baselines on T0 (Sanh et al., 2021), a widely used benchmark in multi-task transfer learning covering a high variety of language understanding tasks. Our experiments reveal several key insights:
Researcher Affiliation	Academia	Zhan Su EMAIL University of Copenhagen, Denmark Fengran Mo EMAIL University of Montreal, Quebec, Canada Prayag Tiwari EMAIL School of Information Technology, Halmstad University, Sweden Benyou Wang EMAIL The Chinese University of Hong Kong, Shenzhen, China Qiuchi Li EMAIL University of Copenhagen, Denmark Jian-Yun Nie EMAIL University of Montreal, Quebec, Canada Jakob Grue Simonsen EMAIL University of Copenhagen, Denmark
Pseudocode	No	The paper describes methods using mathematical formulations and descriptive text, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code	Yes	The code is released: https://github.com/microsoft/mttl
Open Datasets	Yes	To evaluate the effectiveness of our approaches, we perform experiments on multi-task transfer learning datasets T0 benchmark (Sanh et al., 2021), which is widely used in few-shot generalization approaches. This benchmark encompasses a diverse array of tasks, including sentence completion (COPA (Roemmele et al., 2011), H-SWAG(Zellers et al., 2019) and Story Cloze (Sharma et al., 2018) datasets), natural language inference (ANLI (Nie et al., 2019), CB (De Marneffe et al., 2019) and RTE (Dagan et al., 2005)), coreference resolution (WSC (Levesque et al., 2012), Winogrande (Sakaguchi et al., 2021)), and word sense disambiguation (WIC (Pilehvar & Camacho-Collados, 2018)).
Dataset Splits	Yes	For each task, our evaluation strategy involves constructing sets of five few-shot training examples, which are generated by sampling subsets from each dataset using different seeds. We then report the median performance. It should be noted that the prompt examples from each dataset using the prompt templates from P3 (Bach et al., 2022). Datasets To evaluate the generalization capabilities of our models, we adopt the same benchmarking strategy as (Liu et al., 2022), utilizing a subset of tasks designated as held-out from the multitask training.
Hardware Specification	No	The paper does not explicitly mention any specific hardware used for running the experiments, such as GPU models, CPU types, or cloud resources with specifications.
Software Dependencies	No	The paper mentions the use of the T0-3B model and T5 (Raffel et al., 2020) as a foundation model, but it does not provide specific version numbers for any software libraries, frameworks, or programming languages used in their implementation.
Experiment Setup	Yes	To facilitate a fair comparison with baseline methodologies, we have chosen the T0-3B model, consistent with the approach described in the IA3 paper by Liu et al. (2022). In the Full FT scenario, we do not freeze any parameters of the pre-trained model, nor do we insert any adapters, allowing for a comprehensive update of the model's parameters during fine-tuning. For each task, our evaluation strategy involves constructing sets of five few-shot training examples, which are generated by sampling subsets from each dataset using different seeds. We then report the median performance. We set the order N = 2 for this comparison.