How Flawed Is ECE? An Analysis via Logit Smoothing
Authors: Muthu Chidambaram, Holden Lee, Colin Mcswiggen, Semon Rezchikov
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By comparing the ECE and LS-ECE of pre-trained image classification models, we show in initial experiments that binned ECE closely tracks LS-ECE, indicating that the theoretical pathologies of ECE may be avoidable in practice. Lastly, in Section 6, we verify empirically that LS-ECE is continuous even when ECE is not, and also show that for the standard image classification benchmarks of CIFAR-10, CIFAR-100, and Image Net, both ECE and LS-ECE produce near identical results across various models indicating that the theoretical pathologies of ECE may not pose an issue in practice. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Duke University 2Johns Hopkins University 3New York University 4Princeton University. |
| Pseudocode | Yes | Figure 1. Implementation of LS-ECEˆπ,ξ(h) in 10 lines of PyTorch (Paszke et al., 2019) using broadcast semantics. |
| Open Source Code | Yes | All of the code used to generate the plots in this section can be found at: https://github.com/2014mchidamb/how-flawed-is-ece. |
| Open Datasets | Yes | Lastly, in Section 6, we verify empirically that LS-ECE is continuous even when ECE is not, and also show that for the standard image classification benchmarks of CIFAR-10, CIFAR-100, and Image Net, both ECE and LS-ECE produce near identical results across various models indicating that the theoretical pathologies of ECE may not pose an issue in practice. |
| Dataset Splits | No | The paper mentions using 'Image Net-1K validation data' but does not specify the exact percentages or absolute sample counts for the splits, nor does it cite a specific predefined split with authors and year. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'PyTorch' as a tool for implementation in Figure 1 but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | We let σ be the inverse of the number of bins used for ECEBIN,π(g). ... We estimate LS-ECEˆπ,ξ(h) via 10000 independent samples drawn from the distribution of h(X)+ξ... We consider using uniform noise in Appendix D... For our CIFAR experiments, we use pretrained versions (due to Yaofo Chen)... |