Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Leveraging semantic similarity for experimentation with AI-generated treatments

Authors: Lei Shi, David Arbour, Raghavendra Addanki, Ritwik Sinha, Avi Feller

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct semi-synthetic experiments based on three open-source datasets. (i) The Upworthy Research Archive (Matias et al., 2021) is an open dataset of thousands of A/B tests of headlines conducted by Upworthy from January 2013 to April 2015. We use the Exploratory Dataset in The Upworthy Research Archive, which contains headlines and metrics (clicks, impressions, click-through rates, or CTR) from thousands of experiments.
Researcher Affiliation	Collaboration	Lei Shi University of California, Berkeley Berkeley, CA 94720 EMAIL David Arbour Adobe Research San Jose, CA 95110 EMAIL Raghavendra Addanki Adobe Research San Jose, CA 95110 EMAIL Ritwik Sinha Adobe Research San Jose, CA 95110 EMAIL Avi Feller University of California, Berkeley Berkeley, CA 94720 EMAIL
Pseudocode	Yes	Algorithm 1 Double Kernel Learning via Alternating Projection
Open Source Code	Yes	The code for the algorithm and replication of the numerical experiments can be found here: https://github.com/Lei Shi-rocks/DKRL-LLM.
Open Datasets	Yes	In this section, we conduct semi-synthetic experiments based on three open-source datasets. (i) The Upworthy Research Archive (Matias et al., 2021) is an open dataset... (ii) The MIND dataset (Wu et al., 2020) is a benchmark dataset... (iii) The ASOS E-Commerce Dataset2 is an open-source A/B testing dataset from ASOS in the fashion industry. The data is available at https://huggingface.co/datasets/Training Data Pro/ asos-e-commerce-dataset
Dataset Splits	Yes	We split the synthetic dataset into a training set and test set, fit different methods on the training data, and evaluate the fitted results on the test dataset; we repeat this procedure across multiple ranks.
Hardware Specification	Yes	The low computational cost (measured on a Macbook Pro with a M2 Max chip) of kernel-based methods also enables more rapid iteration for large-scale A/B testing.
Software Dependencies	No	The paper mentions software components like "sentence transformer Mini LM (Wang et al., 2020)" and "GPT 4o", but does not provide specific version numbers for these or other software libraries/environments used in their experimental setup.
Experiment Setup	Yes	Upworthy experimental setup. We choose \|Z\| = 50 headlines as candidate treatments in a hypothetical experiment. We use the sentence transformer Mini LM (Wang et al., 2020) to encode the sampled headlines into sentence embeddings of dimension p = 384. Since user-level covariates x are not available in the Upworthy Dataset, we simulate n = 500 Gaussian vectors of dimension q = 200 as baseline covariates. The outcome of interest is the potential revenue that each user can contribute. When a user with covariate x views the headline z, we model the average (centered) potential revenue generated by this particular user as f(z, x) = z Θ x + ϵ, where Θ Rp q is a matrix with varying ranks and ϵ is additive noise.