Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Logit Mixing Training for More Reliable and Accurate Prediction
Authors: Duhyeon Bang, Kyungjune Baek, Jiwoo Kim, Yunho Jeon, Jin-Hwa Kim, Jiwon Kim, Jongwuk Lee, Hyunjung Shim
IJCAI 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experimental results on the imageand language-based tasks demonstrate that Logit Mix achieves state-of-the-art performance among recent data augmentation techniques regarding calibration error and prediction accuracy. |
| Researcher Affiliation | Collaboration | Duhyeon Bang1 , Kyungjune Baek2 , Jiwoo Kim3 , Yunho Jeon4 , Jin-Hwa Kim5 , Jiwon Kim1 , Jongwuk Lee3 and Hyunjung Shim6 1SK T-Brain 2School of Integrated Technology, Yonsei University 3Department of Software, Sungkyunkwan University 4MOFL 5NAVER AI Lab 6 Kim Jaechul Graduate School of AI, KAIST |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Methods are described in text and mathematical formulas. |
| Open Source Code | No | The paper does not provide concrete access to source code, such as a specific repository link or an explicit code release statement for the methodology described. |
| Open Datasets | Yes | The image datasets include CIFAR100 [Krizhevsky and Hinton, 2009] (32 32 RGB images in 100 classes), Tiny Image Net (64 64 RGB images in 100 classes) and ILSVRC2015 [Russakovsky et al., 2015] (256 256 RGB images in 1000 classes). Additionally, the General Language Understanding Evaluation (GLUE) benchmark [Wang et al., 2018]. |
| Dataset Splits | Yes | The image datasets include CIFAR100 [Krizhevsky and Hinton, 2009] (32 32 RGB images in 100 classes), Tiny Image Net (64 64 RGB images in 100 classes) and ILSVRC2015 [Russakovsky et al., 2015] (256 256 RGB images in 1000 classes). Additionally, the General Language Understanding Evaluation (GLUE) benchmark [Wang et al., 2018]. |
| Hardware Specification | Yes | To train the models on all the datasets except for ILSVRC2015, we use a single Titan XP GPU with 12 GB memory. For ILSVRC2015, we utilize four V100 GPU. |
| Software Dependencies | No | The paper mentions using "SGD optimization" and "BERT" but does not provide specific version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | When finetuning BERTBASE (or BERTLARGE), the batch size is 8, the learning rate is 2e 5, the max sequence length is 128, and the number of the training epochs is 3 for all eight tasks. We use a beta distribution with α = 3.0 for λ. |