Humanly Certifying Superhuman Classifiers

Authors: Qiongkai Xu, Christian Walder, Chenchen Xu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the convergence of the bounds and the assumptions of our theory on carefully designed toy experiments with known oracles. Moreover, we demonstrate the utility of our theory by meta-analyzing large-scale natural language processing tasks
Researcher Affiliation Collaboration Qiongkai Xu University of Melbourne Victoria, Australia qiongkai.xu@unimelb.edu.au Christian Walder Google Brain Montreal, Canada cwalder@google.com Chenchen Xu Amazon Canberra, Australia xuchench@amazon.com
Pseudocode Yes Algorithm 1 (Heuristic Margin Separation, HMS). Algorithm 2 (Optimal Margin Separation, OMS).
Open Source Code Yes Our code is available at https://github.com/xuqiongkai/Superhuman-Eval.git.
Open Datasets Yes We use the Stanford Sentiment Treebank (SST) (Socher et al., 2013) for sentiment classification... We use the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) for NLI.
Dataset Splits No The paper mentions training on 'randomly generated examples' for toy tasks and evaluating on 'test sets' for real-world tasks (SST, SNLI), but it does not provide specific percentages or counts for training, validation, and test splits used in their experiments, nor does it explicitly cite standard splits with specific details.
Hardware Specification No The paper mentions training YOLOv3 models but does not provide any specific details about the hardware (e.g., GPU, CPU models, or memory) used for these experiments.
Software Dependencies No The paper refers to models and architectures like YOLOv3, Darknet-53, CNN-LSTM, Bi-LSTM, Tree-LSTM, Tree-CNN, BERT-large, LM-Pretrained Transformer, RoBERTa+Self-Explaining, StructBERT, and SemBERT, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes The input image resolution is 608 × 608, and we use the proposed Darknet-53 as the backbone feature extractor... All models are trained for a maximum of 200 epochs until convergence.