Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning
Authors: Yiming Wang, Pei Zhang, Baosong Yang, Derek Wong, Zhuosheng Zhang, Rui Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our method outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios and can be extended to more applications with high-density features in output spaces, such as multiple-choice questions. |
| Researcher Affiliation | Collaboration | αShanghai Jiao Tong University βTongyi Lab γNLP2CT Lab, University of Macau |
| Pseudocode | Yes | The pseudo-code of our TV Score computation pipeline is shown in Algorithm 1. |
| Open Source Code | Yes | https://github.com/Alsace08/OOD-Math-Reasoning |
| Open Datasets | Yes | For the ID dataset, we use the Multi Arith [44], which consists of Math Word Problems on arithmetic reasoning. |
| Dataset Splits | No | Given the limited data size of Multi Arith, totaling only 600 samples and lacking a standard division, we allocate 360 samples for training and 240 for testing. However, with such a small test set, randomness in evaluation becomes a concern. To mitigate this, we conduct test sampling and set the sampling size as 1000. Specifically, we denote ID dataset as Din and OOD dataset as Dout. For each sampling, the collection is {Din, e Dout} where e Dout Dout and |Din| = | e Dout|, this guarantees positive and negative sample balance. |
| Hardware Specification | Yes | Llama2-7B is trained with Adam W optimizer [29] for 10K steps and 8 batch sizes in 4-card RTX 3090 (2 per card). ... GPT2-XL is trained for 3K steps and 128 batch sizes in a single RTX 3090... We use Llama2-7B as the training backbone, each model is trained for 3K steps and 8 batch sizes in 4-card NVIDIA Tesla V100 (2 per card). |
| Software Dependencies | No | The paper mentions software like Llama-2 tokenizator, Llama2-7B, GPT2-XL, Sim CSE, UMAP, and Adam W optimizer, but does not specify their version numbers. |
| Experiment Setup | Yes | Llama2-7B is trained with Adam W optimizer [29] for 10K steps and 8 batch sizes in 4-card RTX 3090 (2 per card). The learning rate is set to 1e-5, the warmup step to 10, and the maximum gradient normalization to 0.3. GPT2-XL is trained for 3K steps and 128 batch sizes in a single RTX 3090, and other configurations are the same as Llama2-7B. |