reproducibilityindex.ai

Efficiently Controlling Multiple Risks with Pareto Testing

Authors: Bracha Laufer-Goldshtein, Adam Fisch, Regina Barzilay, Tommi S. Jaakkola

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our approach to reliably accelerate the execution of large-scale Transformer models in natural language processing (NLP) applications. ... 7 EXPERIMENTS Experimental setup. we test our method over five text classification tasks of varied difficulty levels: IMDB (Maas et al., 2011), AG News (Zhang et al., 2015), QNLI (Rajpurkar et al., 2016), QQP, MNLI (Williams et al., 2018).
Researcher Affiliation	Academia	Bracha Laufer-Goldshtein, Adam Fisch, Regina Barzilay & Tommi Jaakkola CSAIL, MIT, {lauferb,fisch,regina,tommi}@csail.mit.edu
Pseudocode	Yes	Algorithm 1 Pareto Testing Definitions: f is a configurable model with n thresholds λ = (λ1, . . . , λn). ... Algorithm F.1 Recover Pareto Optimal Set Definitions: ... Algorithm F.2 Learn then Test (Single Objective) Definitions: ... Algorithm F.3 3D Graph Testing Definitions: ... Algorithm F.4 Shortest-Path Testing Definitions: ...
Open Source Code	Yes	Code. Our code will be made available at https://github.com/bracha-laufer/ pareto-testing.
Open Datasets	Yes	We test our method over five text classification tasks of varied difficulty levels: IMDB (Maas et al., 2011), AG News (Zhang et al., 2015), QNLI (Rajpurkar et al., 2016), QQP, MNLI (Williams et al., 2018).
Dataset Splits	Yes	Table C.1: Datasets Details... IMDB \|Y\| Task Train Val. Test Cal. (out of Test) Full model Acc. [%]... IMDB 2 Sentiment analysis on movie reviews 20K 5K 10K 5K 94... Algorithm 1 Pareto Testing Definitions: ... Dcal = Dopt Dtesting is a calibration set of size m, split into optimization and (statistical) testing sets of size m1 and m2, respectively.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its experiments, such as GPU models (e.g., NVIDIA A100, Tesla V100) or CPU specifications (e.g., Intel Core i7).
Software Dependencies	No	The paper mentions using a 'BERT-base model' and discusses deep learning concepts, but it does not specify software dependencies with version numbers (e.g., 'PyTorch 1.x', 'Python 3.x').
Experiment Setup	Yes	Experimental setup. we test our method over five text classification tasks... We use a BERT-base model (Devlin et al., 2018) with K = 12 layers and W = 12 heads per layer... Prediction Heads. Each prediction head is a 2-layer feed-forward neural network with 32 dimensional hidden states, and ReLU activation... Token importance predictors. Each token importance predictor is a 2-layer feed-forward neural network with 32 dimensional hidden states, and ReLU activation. ... Training. The core model is first finetuned on each task.