Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Probing Classifiers are Unreliable for Concept Removal and Detection
Authors: Abhinav Kumar, Chenhao Tan, Amit Sharma
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on four datasets natural language inference, sentiment analysis, tweet-mention detection, and a synthetic task confirm our claims. |
| Researcher Affiliation | Collaboration | Abhinav Kumar Microsoft Research EMAIL Chenhao Tan University of Chicago EMAIL Amit Sharma Microsoft Research EMAIL |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states in its checklist that code is included (See E) but Appendix E does not provide a specific URL or explicit statement about the release of source code for the paper's methodology. |
| Open Datasets | Yes | We use three datasets: Multi NLI [46], Twitter-PAN16 [31] and Twitter-AAE [6]. |
| Dataset Splits | Yes | For Multi NLI, we use standard validation/test splits provided in the dataset. |
| Hardware Specification | No | The paper states 'All experiments run on a single NVIDIA GPU' but does not provide specific hardware details such as the GPU model, CPU type, or memory specifications. |
| Software Dependencies | No | The paper mentions using RoBERTa, GloVe embeddings, and the AdamW optimizer, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train for 20 epochs for all datasets, except for Synthetic-Text and Multi NLI, for which we train for 40 epochs. We use AdamW optimizer with a learning rate of 1e-5. We use a batch size of 32 for Multi NLI, 16 for Twitter-PAN16, 8 for Twitter-AAE, and 64 for Synthetic-Text. |