Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Who’s Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation

Authors: Trenton Chang, Lindsay Warrenburg, Sae-Hwan Park, Ravi Parikh, Maggie Makar, Jenna Wiens

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical results in a synthetic data study validating the usage of causal effect estimation for gaming detection and show in a case study of diagnosis coding behavior in the U.S. that our approach highlights features associated with gaming. 5 Empirical results & discussion
Researcher Affiliation Academia 1University of Michigan 2University of Pennsylvania 3Emory University
Pseudocode Yes Figure 4: Pseudocode for causally-motivated gaming detection.
Open Source Code Yes Code to replicate our experiments will be made publicly available at https://github.com/MLD3/gaming_detection.
Open Datasets Yes Full synthetic data generation details are in Appendix C.1. Our cohort is drawn from a 20% sample of all U.S. Medicare beneficiaries provided to the authors under a data usage agreement with the Center for Medicare & Medicaid Services.
Dataset Splits Yes We perform a 7:3 dataset train-test split, training all models on the larger split. All rankings are computed on the test split. Early stopping is performed on a 20% validation split randomly sampled from the training set.
Hardware Specification Yes All experiments were run on either one Titan V or V100 GPU using 12.9GB of RAM as managed via a Slurm job submission system. Computing nodes had two 2.10GHz Intel Broadwell (Xeon E5-2620V4) processors each (16 cores total).
Software Dependencies Yes All code was written in Python 3.10.4 (license: PSF). All non-causal anomaly detection approaches were implemented using Py OD (license: BSD 2-clause) [56]. All neural networks were implemented in Py Torch 2.2.0 (license: Custom BSD-style 9) [57], using Skorch 0.15.0 (license: BSD 3-clause) [58] as a wrapper. Metrics were computed using both Scikit-Learn 1.3.2 (license: BSD 3-clause) [59] and Scipy 1.11.4 (license: BSD 3-clause) [60]. For the fully synthetic data generation process, CVXPY 1.4.2 (license: Apache 2.0) [61] was used to solve each agent s utility maximization problem, and used in tandem with SCIP 9.0 (pyscipopt 5.0.0; license: Apache 2.0) for the matching approaches (formulated as mixed-integer programs) [62]. Numpy 1.22.3 (license: BSD-style) [63]10 and Pandas 2.0.3 (license: BSD 3-clause) [64] were used for data manipulation. Matplotlib 3.8.2 (empirical results; license: PSF-style)11 and Adobe Illustrator 2023 (overview figures; license: commercial, Named User Licensing 12) were used for figure generation. For the Medicare cohorts, we generated HCC (Hierarchical Condition Categories; used by the Center for Medicare Services) codes from raw diagnosis codes reported in claims data via HCCPy 0.1.9 (license: Apache 2.0)13.
Experiment Setup Yes Optimizer: SGD with learning rate 10^-2 and weight decay 10^-3. Learning rate schedule: We reduce the learning rate by a factor of 0.1 after 5 epochs of non-improvement with respect to the validation loss. Training length: A maximum of 1000 epochs, with early stopping (patience: 10 epochs) based on validation loss.