reproducibilityindex.ai

Humanly Certifying Superhuman Classifiers

Authors: Qiongkai Xu, Christian Walder, Chenchen Xu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the convergence of the bounds and the assumptions of our theory on carefully designed toy experiments with known oracles. Moreover, we demonstrate the utility of our theory by meta-analyzing large-scale natural language processing tasks
Researcher Affiliation	Collaboration	Qiongkai Xu University of Melbourne Victoria, Australia qiongkai.xu@unimelb.edu.au Christian Walder Google Brain Montreal, Canada cwalder@google.com Chenchen Xu Amazon Canberra, Australia xuchench@amazon.com
Pseudocode	Yes	Algorithm 1 (Heuristic Margin Separation, HMS). Algorithm 2 (Optimal Margin Separation, OMS).
Open Source Code	Yes	Our code is available at https://github.com/xuqiongkai/Superhuman-Eval.git.
Open Datasets	Yes	We use the Stanford Sentiment Treebank (SST) (Socher et al., 2013) for sentiment classification... We use the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) for NLI.
Dataset Splits	No	The paper mentions training on 'randomly generated examples' for toy tasks and evaluating on 'test sets' for real-world tasks (SST, SNLI), but it does not provide specific percentages or counts for training, validation, and test splits used in their experiments, nor does it explicitly cite standard splits with specific details.
Hardware Specification	No	The paper mentions training YOLOv3 models but does not provide any specific details about the hardware (e.g., GPU, CPU models, or memory) used for these experiments.
Software Dependencies	No	The paper refers to models and architectures like YOLOv3, Darknet-53, CNN-LSTM, Bi-LSTM, Tree-LSTM, Tree-CNN, BERT-large, LM-Pretrained Transformer, RoBERTa+Self-Explaining, StructBERT, and SemBERT, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup	Yes	The input image resolution is 608 × 608, and we use the proposed Darknet-53 as the backbone feature extractor... All models are trained for a maximum of 200 epochs until convergence.