Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models

Authors: Alexey Karev, Dong Xu

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the efficacy of Con SComp F, two experiments aimed at identifying similarities between multiple LLMs are conducted. Additionally, these experiments examine the correlation between the similarity scores generated by Con SComp F and the differences in the outputs produced by other benchmarking techniques, such as ROUGE-L. Finally, a series of few-shot LLM comparison experiments is conducted to evaluate the performance of Con SComp F in a few-shot LLM comparison scenario.
Researcher Affiliation Academia Alexey Karev EMAIL Dong Xu EMAIL (Corresponding author) School of Computer Engineering and Science Shanghai University, Shanghai, China
Pseudocode No The paper describes the methodology in Section 3, outlining six steps with explanatory text and equations (e.g., Equation 1, 2, 3, 4). It also includes workflow diagrams (Figure 1, Figure 2). However, it does not present any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code No The paper mentions using the `bitsandbytes Python library` and obtaining models from `The Bloke (Jobbins, 2024)` but does not provide a link or explicit statement about releasing the source code for the Consistency-focused Similarity Comparison Framework (Con SComp F) itself.
Open Datasets Yes These instructions are extracted from the Alpaca dataset (Taori et al., 2023), which comprises 52,000 pairs of instructions and golden answers specifically designed for fine-tuning LLM-based AI assistants.
Dataset Splits Yes For the quantization experiment, we sampled 10% of the original dataset, resulting in 5,200 samples. Due to the increased number of parameters in the LLMs used in the second experiment, we had to further reduce the number of samples to 520, which is equivalent to 1% of the original dataset. Additionally, to test the performance of the proposed framework in few-shot scenarios, we randomly sample two few-shot datasets with 50 and 20 samples and three few-shot datasets with 10 samples using special curated lists of instructions with average instruction consistency scores equal to 0.56, 0.73, and 0.95.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It only mentions the use of 'GGUF quantized versions of models' and obtaining 'non-GGUF models' using a Python library, which refers to model formats and software rather than the underlying hardware.
Software Dependencies No The paper mentions 'Transformers library in Python' and the 'bitsandbytes Python library' but does not specify their version numbers. It also refers to 'GGUF quantized versions of models' without a specific version for the GGUF format itself or the GGUF library used.
Experiment Setup Yes In both experiments, all models had the same text generation settings: a temperature of 0.7, a top-k of 50, a top-p of 0.95, and a maximum answer length of 128 tokens.