Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula

Authors: Sieun Hyeon, Kyudan Jung, Jaehee Won, Nam-Joon Kim, Hyun Gon Ryu, Hyuk-Jae Lee, Jaeyoung Do

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated on a new dataset derived from lecture recordings, Math Speech demonstrates LATEX generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for LATEX translation, Math Speech demonstrated significantly superior capabilities compared to GPT4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.
Researcher Affiliation	Collaboration	1Department of Electrical and Computer Engineering, Seoul National University 2Department of Mathematics, Chung-Ang University 3College of Liberal Studies, Seoul National University 4NVIDIA 5Interdisciplinary Program in Artificial Intelligence, Seoul National University
Pseudocode	No	The paper describes the methodology using textual explanations and figures (Figure 1, Figure 2, Figure 3, Figure 4) illustrating the pipeline and data collection process, but it does not include any explicit pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code https://github.com/hyeonsieun/Math Speech
Open Datasets	Yes	To address this gap, we first developed a novel benchmark dataset comprising 1,101 audio samples from real mathematics lectures available on You Tube. This dataset2 serves as a crucial tool for assessing the capabilities of various ASR models in mathematical speech recognition. [...] 2https://huggingface.co/datasets/AAAI2025/Math Speech. [...] We were able to obtain a publicly available dataset (Jung et al. 2024a) on Hugging Face and used it in our work.
Dataset Splits	No	The paper mentions collecting 6M ASR error results using Whisper-base and small, and 1M ASR error results using Whisper-large V2 and Canary-1b for training, and that the model with the lowest validation loss was selected during training. However, it does not specify the train/test/validation splits for the main evaluation dataset (1,101 audio samples from real mathematics lectures).
Hardware Specification	Yes	T5-small was trained with a batch size of 48 on an NVIDIA A100, and T5-base with a batch size of 84 on an NVIDIA H100. As a comparison group for our pipeline, we selected GPT3.5 (Open AI 2024b), GPT-4o (Open AI 2024a), and Gemini Pro (Google Deepmind 2024), using 1-shot prompting with one example for all. [...] When inference latency was measured on an NVIDIA V100 GPU, it took 0.45 seconds to convert the ASR result of 5 seconds of speech into LATEX.
Software Dependencies	No	The paper refers to various models and APIs like T5-small, T5-base, GPT-3.5, GPT-4o, Gemini Pro, Whisper, Canary, and VITS. However, it does not provide specific version numbers for underlying software libraries, frameworks, or programming languages (e.g., Python, PyTorch, CUDA versions) that would be needed for replication.
Experiment Setup	Yes	The maximum number of training epochs was set to 20, and the model with the lowest validation loss was selected. The learning rate was set to a maximum of 1e-4 and a minimum of 1e-6, adjusted using a linear learning rate scheduler. For the Error Corrector, which requires two ASR outputs as input, the maximum input sequence length was set to 540, with an output length of 275. For the LATEX translator, both input and output sequence lengths were set to 275. T5-small was trained with a batch size of 48 on an NVIDIA A100, and T5-base with a batch size of 84 on an NVIDIA H100. Our experimental results showed that setting the weight λ1 for SE to 0.3 and the weight for LATEX to 0.7 yielded the best performance, so we used these values.