Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations

Authors: Faisal Hamman, Pasan Dissanayake, Yanjun Fu, Sanghamitra Dutta

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform experiments across various datasets and LLMs to show that COD outperforms standard distillation approaches in few-shot regimes (as low as 8 512 samples).
Researcher Affiliation	Academia	Faisal Hamman Pasan Dissanayake Yanjun Fu Sanghamitra Dutta University of Maryland, College Park EMAIL
Pseudocode	Yes	Algorithm 1 COD: CFE-infused Distillation Require: Teacher gt, student gs, dataset Dk={(xi, yi)}k i=1, CFGen, learning rate η, loss weights α (KD), β (LWD), Epochs E
Open Source Code	Yes	Our code is available at https://github.com/Faisal Hamman/Co D.
Open Datasets	Yes	Datasets. We evaluate COD across six text classification benchmarks that span a range of domains. SST2 is a binary sentiment classification task derived from movie review snippets [65]. Sentiment140 consists of tweets labeled as positive or negative, reflecting user sentiment in short social media posts [66]. IMDB is a binary sentiment classification dataset containing full-length movie reviews [67]. Co LA (Corpus of Linguistic Acceptability) is a grammaticality judgment task that requires the model to identify whether a sentence is linguistically acceptable [68]. Amazon Polarity contains customer reviews labeled as positive or negative sentiment [69]. Yelp is another sentiment classification dataset based on user-generated restaurant reviews [70].
Dataset Splits	Yes	Yelp [70]: We use the Yelp Review Full dataset, filtering for reviews with at most 250 tokens and discarding neutral labels. Labels are binarized: 1 2 as negative and 4 5 as positive. The processed dataset contains 106,624 training examples, 1,000 for validation, and 7,074 for testing, with a slightly imbalanced class distribution (64% negative).
Hardware Specification	Yes	All experiments are conducted on a server equipped with four NVIDIA RTX A6000 GPUs.
Software Dependencies	No	The paper mentions 'FP16' and 'Adam' optimizer but does not provide specific version numbers for software like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We fine-tune the teacher model using De BERTa V3-base, initialized with a classification head for each target task. For the teacher, we use a dropout rate of 0.1, linear learning rate decay, and train for 8 epochs with a fixed learning rate of 2 10 5 and batch sizes of {32, 64}. Optimization is performed using Adam with ϵ = 1 10 6, β1 = 0.9, and β2 = 0.98, without weight decay. Mixed-precision training with FP16 is used throughout. For distillation, the student is initialized from a pre-trained De BERTa-v3-small or De BERTa-v3-xsmall model. We search learning rates in the range [1 10 5, 5 10 5], and use a fixed batch size of 8 in our few-shot experiments. All student models are trained for 10 epochs using Adam with the same optimizer settings as the teacher. For KD and LWD baselines, we set the distillation loss weight to 20.