reproducibilityindex.ai

Language Models are Weak Learners

Authors: Hariharan Manikandan, Yiding Jiang, J. Zico Kolter

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct all of our experiments with Open AI s GPT-3 API [Brown et al., 2020] and choose a collection of 18 tabular datasets from the UCI dataset [Dua and Graff, 2017] and Open ML [Vanschoren et al., 2014]. All main experiments are done with the Curie variant of GPT-3 unless otherwise specified, which has 13B parameters1. We compare the following methods: Zero-shot: query the language model with the data description and ask the model to complete the answer (refer Appendix A.4). Few-shot: provide a few labeled data descriptions of the training data as the context and ask the model to complete the answer for a new data description. To preserve consistency, we standardize the number of fewshot examples to approximately 15 for all datasets. The setting is explained in Appendix A.5. Summary (ours): generate a population of summaries given a list of data descriptions with cluster sampling and pick the summary with the lowest validation error; use the best summary as the context and ask the model to complete the answer for a new data description. Summary Boosting (ours): use Summary as a subroutine in Ada Boost.
Researcher Affiliation	Collaboration	Hariharan Manikandan1 Yiding Jiang1 J Zico Kolter1,2 1Carnegie Mellon University 2Bosch Center for AI {hmanikan, yidingji, zkolter}@cs.cmu.edu
Pseudocode	Yes	Algorithm 1 Cluster Sampling 1: Input: X, all training data; y, all training label; r, ratio of classes; p, Ada Boost weights of the current round; s, target number of samples. r[k] is the proportion of examples in class k. 2: S new empty set 3: w new array with same length as X filled with -1. w[i] is probability of sampling example i. 4: for k = 1 to number of target classes in y do 5: E GPTEmbedding(X[y == k]) E refers to the embeddings of the data descriptions 6: C Agglomerative Clustering(E). Cj is set of data indices present in the jth cluster. 7: c new empty array same size as C. c[j] will store sampling probability of cluster j. 8: for j = 1 to len(C) do 9: c[j] len(X) len(Cj) 10: end for 11: for i = 1 to len(X) do 12: w[i] c[j], such that, i Cj 13: end for 14: w Normalize(Normalize(w) p) Normalize turns weights to a probability distribution. 15: Sample s r[c] examples from X using categorical distribution w and append to S. 16: end for 17: Return S
Open Source Code	No	The paper does not provide a link to its own source code or explicitly state that the code for its methodology is open-source.
Open Datasets	Yes	We conduct all of our experiments with Open AI s GPT-3 API [Brown et al., 2020] and choose a collection of 18 tabular datasets from the UCI dataset [Dua and Graff, 2017] and Open ML [Vanschoren et al., 2014].
Dataset Splits	Yes	For each method and dataset, we use a 50/10/40 split for train, validation, and test sets and repeat each experiment for 3 random seeds
Hardware Specification	Yes	We conduct all of our experiments with Open AI s GPT-3 API [Brown et al., 2020] and choose a collection of 18 tabular datasets from the UCI dataset [Dua and Graff, 2017] and Open ML [Vanschoren et al., 2014]. All main experiments are done with the Curie variant of GPT-3 unless otherwise specified, which has 13B parameters.
Software Dependencies	No	The paper mentions using "Open AI s GPT-3 API" and "Claude-2" but does not specify software versions for these or other key software components used in the experiments (e.g., Python, PyTorch versions).
Experiment Setup	Yes	We perform ablation studies over the Summary method, to decide hyperparameters for getting a good weak learner. Preprocessing of continuous attributes. ...After hyperparameter tuning, we identified that using 5 bins provides sufficient granularity to distinguish variations in the continuous values. Ordering of examples. ...We use shuffled for all other experiments. Different summary and inference prompts. ...detailed prompts were used in all other experiments. ...we use prefix prompt for all the other experiments.