Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Task-Agnostic Representations through Multi-Teacher Distillation
Authors: Philippe Formont, Maxime Darrin, Banafsheh Karimian, Eric Granger, Jackie CK Cheung, Ismail Ayed, Mohammadhadi Shateri, Pablo Piantanida
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations across text, vision models, and molecular modeling show that our method effectively leverages teacher diversity, resulting in representations enabling better performance for a wide range of downstream tasks such as classification, clustering, or regression. Additionally, we train and release state-of-the-art embedding models, enhancing downstream performance in various modalities. |
| Researcher Affiliation | Academia | Philippe Formont Universite Paris-Saclay ETS Montreal Mila Quebec AI Institute LIVIAILLS Maxime Darrin Mc Gill University Universite Paris-Saclay Mila Quebec AI Institute ILLS Banafsheh Karimian ETS Montreal ILLS LIVIA Jackie CK Cheung Mc Gill University Mila Quebec AI Institute Eric Granger ETS Montreal ILLS LIVIA Ismail Ben Ayed ETS Montreal ILLS LIVIA Mohammadhadi Shateri ETS Montreal LIVIA Pablo Piantanida CNRS Centrale Supelec Universite Paris-Saclay ILLS Mila Quebec AI Institute |
| Pseudocode | Yes | Algorithm 1 Distillation through Gaussian Kernels Input: Dataset D = {xi}, Embedders (Tk)1 k K, Student embedder S, Number of iterations T, Learning rate η Initialize the parameters θs of the student embedder Es and the parameters θk of the parametric Gaussian kernels for t = 1 to T do Sample a batch of inputs {xi} Compute the embeddings tk i = Tk(xi) 1 k K Compute the student embeddings {si = S(xi)} Compute the loss LNLL = PK k=1 PN i=1 log N(tk i |µk(si), Σk(si)) Update the parameters θs and θk using the Adam optimizer. end for |
| Open Source Code | Yes | 2https://github.com/ills-montreal/nlp-distill 3https://github.com/ills-montreal/mol-distill 4https://github.com/ills-montreal/vision-distill/ |
| Open Datasets | Yes | We gathered different common datasets used for training embedders and collected 6 million entries from the Huggingface Hub, including Specter (Cohan et al., 2020), T5 (Ni et al., 2021), Amazaon QA (Mc Auley & Leskovec, 2013), IMDB (Maas et al., 2011), SNLI (Bowman et al., 2015), QQP triplets from Quora, AG News (Zhang et al., 2015), MEDI dataset (Su et al., 2023) and the DAIL Emotion dataset (Saravia et al., 2018). ... We evaluated all models on the ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) tasks of the Therapeutic Data Commons platform (TDC) (Huang et al., 2021) and on a high-throughput screening task (HTS), (HIV (Wu et al., 2018))... Both are public datasets of commercially available compounds designed to be used in various therapeutic projects. |
| Dataset Splits | Yes | For every task, we opted for a random split since we obtained similar results to a scaffold split, with a faster computation time, with a ratio of 70/10/20 for the train/validation/test sets. For the augmentation we used color jitter with brightness, contrast, saturation and hue equal to 0.2, and random horizontal flip (except for the SVHN dataset). ... We split the official training part, if there are no official validation sets, to train and validation set with 80 and 20 percents of the data, consequently. |
| Hardware Specification | Yes | Models are trained for two epochs with batch size 16 on NVIDIA V100. Training our molecular embedders on the largest dataset (2 M molecules) takes approximately 50 hours on 6 A6000 GPUs. Our experiments were conducted in single GPUs settings. We used NVIDIA V100 GPUs for about 3000 GPUs hours to train our different models. |
| Software Dependencies | No | The paper does not explicitly state specific version numbers for software dependencies such as Python, PyTorch, or other libraries. It mentions frameworks and optimizers (e.g., Huggingface Hub, Adam optimizer) but without version details for the underlying software stack. |
| Experiment Setup | Yes | Models are trained for two epochs with batch size 16 on NVIDIA V100. We trained our models using the Adam optimizer with a constant learning rate of 5.10-5 and an effective batch size of 16 for all our models. We use a batch size of 256 and a learning rate of 1e-4 to train the model for 400 epochs on the 250k dataset and 200 epochs on the 2M dataset. For the optimizer we use Adam, with learning rate of 0.001, a batch size of 128, trained for 50 epochs. Our search space includes the learning rate with values (1e-2, 1e-3), the number of fully connected layer units with values (0, 128), and the type of normalization after the fully connected layer, considering (no optimization, batch normalization, layer normalization). The models are trained for a maximum of 1000 epochs with a batch size of 128, but we apply early stopping with a patience of 20 to prevent over-fitting and reduce unnecessary computation. |