Humanly Certifying Superhuman Classifiers
Authors: Qiongkai Xu, Christian Walder, Chenchen Xu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the convergence of the bounds and the assumptions of our theory on carefully designed toy experiments with known oracles. Moreover, we demonstrate the utility of our theory by meta-analyzing large-scale natural language processing tasks |
| Researcher Affiliation | Collaboration | Qiongkai Xu University of Melbourne Victoria, Australia qiongkai.xu@unimelb.edu.au Christian Walder Google Brain Montreal, Canada cwalder@google.com Chenchen Xu Amazon Canberra, Australia xuchench@amazon.com |
| Pseudocode | Yes | Algorithm 1 (Heuristic Margin Separation, HMS). Algorithm 2 (Optimal Margin Separation, OMS). |
| Open Source Code | Yes | Our code is available at https://github.com/xuqiongkai/Superhuman-Eval.git. |
| Open Datasets | Yes | We use the Stanford Sentiment Treebank (SST) (Socher et al., 2013) for sentiment classification... We use the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) for NLI. |
| Dataset Splits | No | The paper mentions training on 'randomly generated examples' for toy tasks and evaluating on 'test sets' for real-world tasks (SST, SNLI), but it does not provide specific percentages or counts for training, validation, and test splits used in their experiments, nor does it explicitly cite standard splits with specific details. |
| Hardware Specification | No | The paper mentions training YOLOv3 models but does not provide any specific details about the hardware (e.g., GPU, CPU models, or memory) used for these experiments. |
| Software Dependencies | No | The paper refers to models and architectures like YOLOv3, Darknet-53, CNN-LSTM, Bi-LSTM, Tree-LSTM, Tree-CNN, BERT-large, LM-Pretrained Transformer, RoBERTa+Self-Explaining, StructBERT, and SemBERT, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions). |
| Experiment Setup | Yes | The input image resolution is 608 × 608, and we use the proposed Darknet-53 as the backbone feature extractor... All models are trained for a maximum of 200 epochs until convergence. |