Quantifying and Reducing Bias in Maximum Likelihood Estimation of Structured Anomalies
Authors: Uthsav Chitra, Kimberly Ding, Jasper C.H. Lee, Benjamin J Raphael
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we demonstrate that in the normal means setting, the bias of the MLE depends on the size of the anomaly family. We prove that if the number of sets in the anomaly family that contain the anomaly is sub-exponential, then the MLE is asymptotically unbiased. We also provide empirical evidence that the converse is true: if the number of such sets is exponential, then the MLE is asymptotically biased. Our analysis unifies a number of earlier results on the bias of the MLE for specific anomaly families. Next, we derive a new anomaly estimator using a mixture model, and we prove that our anomaly estimator is asymptotically unbiased regardless of the size of the anomaly family. We illustrate the advantages of our estimator versus the MLE on disease outbreak data and highway traffic data. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Princeton University, Princeton, New Jersey, USA 2Department of Computer Science, Brown University, Providence, Rhode Island, USA. Correspondence to: Benjamin J. Raphael <braphael@princeton.edu>. |
| Pseudocode | No | The paper describes the steps of its GMM-based anomaly estimator in paragraph text (e.g., 'Given data X ~ ASDS(A, µ), we first use the EM algorithm to fit a GMM to the data X.'), but it does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Next, we simulate a disease outbreak on the Northeastern USA Benchmark (NEast) graph, a standard benchmark for estimating spatial anomalies (Cadena et al., 2018a; 2019). ... We also compare our estimator b AGMM and the MLE b AS on a real-world highway traffic dataset; similar to the NEast graph, this dataset is also often studied in the scan statistic literature (Zhou & Chen, 2016; Cadena et al., 2018a; 2019). ... We also compared our estimator and the MLE on a dataset of breast cancer incidence in census blocks in Manhattan (Boscoe et al., 2016) using the connected family CG. |
| Dataset Splits | No | The paper describes how samples were generated or obtained for experiments (e.g., 'We draw a sample X ~ ASDS(A, µ) with n = 900 observations, and compute the MLE b AS. We repeat for 50 samples to estimate Bias(| b AS|/n).'), but it does not specify explicit train/validation/test dataset splits needed for reproduction in the context of model evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions methods like 'EM algorithm' and 'convex program' but does not specify any software packages, libraries, or their version numbers that were used in the implementation. |
| Experiment Setup | Yes | For each anomaly family S, we select an anomaly A S with size |A| = 0.05n uniformly at random from S. We draw a sample X = (X1, . . . , Xn) ~ ASDS(A, µ) with n = 900 observations, and compute the MLE b AS. We repeat for 50 samples to estimate Bias(| b AS|/n). We perform this process for a range of means µ ≥ µdetect. ... For the interval family In and submatrix family MN, where | S(A)| is sub-exponential, we find that Bias(| b AS|/n) ≈ 0 for all means µ ≥ µdetect (Figure 2A). ... with an Erd os-Rényi random graph G (edge probability = 0.01). |