Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Private and Non-private Uniformity Testing for Ranking Data
Authors: Róbert Busa-Fekete, Dimitris Fotakis, Emmanouil Zampetakis
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carry out large-scale experiments, including m = 10, 000, to show that our uniformity testing algorithms scale gracefully with m. and 7 Experiments We shall present synthetic experiments to assess the performance of the proposed tests. |
| Researcher Affiliation | Collaboration | Róbert Busa-Fekete Google Research, New York, USA EMAIL Dimitris Fotakis National Technical University of Athens, Greece EMAIL Manolis Zampetakis University of California, Berkeley, USA EMAIL |
| Pseudocode | Yes | Algorithm 1 2SAMP: Uniformity Test with Two Samples, Algorithm 2 Uniformity Test (UNIF), Algorithm 3 Central DP Uniformity Test (TRUN), Algorithm 5, Algorithm 6 |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | No | We shall present synthetic experiments to assess the performance of the proposed tests. and We used synthetic data. No information provided for public access to the synthetic data itself or the exact generation process to reproduce it as a dataset. |
| Dataset Splits | No | The paper discusses sample complexity for statistical tests and uses synthetic data, but does not mention any training, validation, or test dataset splits. |
| Hardware Specification | No | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] We used data centers to compute the experiments. I believe that it is not so relevant to this work how long the computation did take. |
| Software Dependencies | No | The paper does not provide specific software dependencies or version numbers for the key software components used in the experiments. |
| Experiment Setup | Yes | Every testing algorithm we presented has a tolerance parameter and significance δ. We used δ = 0.05 in every case. The tolerance parameter does have impact only on the sample size of the testing algorithms. Instead of setting to a certain value, we plotted the power of the algorithms with various sample size. In this way, we could compare the performance of the testing algorithms based on the same number of samples as input. Each result we report here are computed based on 1000 repetitions. The central ranking of each model which the random samples are generated from, is selected uniformly at random in each each run independently. |