reproducibilityindex.ai

Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning

Authors: Yiming Wang, Pei Zhang, Baosong Yang, Derek Wong, Zhuosheng Zhang, Rui Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our method outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios and can be extended to more applications with high-density features in output spaces, such as multiple-choice questions.
Researcher Affiliation	Collaboration	αShanghai Jiao Tong University βTongyi Lab γNLP2CT Lab, University of Macau
Pseudocode	Yes	The pseudo-code of our TV Score computation pipeline is shown in Algorithm 1.
Open Source Code	Yes	https://github.com/Alsace08/OOD-Math-Reasoning
Open Datasets	Yes	For the ID dataset, we use the Multi Arith [44], which consists of Math Word Problems on arithmetic reasoning.
Dataset Splits	No	Given the limited data size of Multi Arith, totaling only 600 samples and lacking a standard division, we allocate 360 samples for training and 240 for testing. However, with such a small test set, randomness in evaluation becomes a concern. To mitigate this, we conduct test sampling and set the sampling size as 1000. Specifically, we denote ID dataset as Din and OOD dataset as Dout. For each sampling, the collection is {Din, e Dout} where e Dout Dout and \|Din\| = \| e Dout\|, this guarantees positive and negative sample balance.
Hardware Specification	Yes	Llama2-7B is trained with Adam W optimizer [29] for 10K steps and 8 batch sizes in 4-card RTX 3090 (2 per card). ... GPT2-XL is trained for 3K steps and 128 batch sizes in a single RTX 3090... We use Llama2-7B as the training backbone, each model is trained for 3K steps and 8 batch sizes in 4-card NVIDIA Tesla V100 (2 per card).
Software Dependencies	No	The paper mentions software like Llama-2 tokenizator, Llama2-7B, GPT2-XL, Sim CSE, UMAP, and Adam W optimizer, but does not specify their version numbers.
Experiment Setup	Yes	Llama2-7B is trained with Adam W optimizer [29] for 10K steps and 8 batch sizes in 4-card RTX 3090 (2 per card). The learning rate is set to 1e-5, the warmup step to 10, and the maximum gradient normalization to 0.3. GPT2-XL is trained for 3K steps and 128 batch sizes in a single RTX 3090, and other configurations are the same as Llama2-7B.