Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

Authors: Junxuan Zhang, Zhengxue Cheng, Yan Zhao, Shihao Wang, Dajiang Zhou, Guo Lu, Li Song

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments demonstrating that RWKV models achieve the fastest decoding speed with a moderate compression ratio, making it the most suitable backbone for our method. Second, we propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens while allowing outliers to bypass the prediction and encoding. Third, we propose a novel high-rank reparameterization strategy that enhances the learning capability during training without increasing complexity during inference. Experimental results validate that our method achieves 48% bit saving compared to gzip compressor.
Researcher Affiliation Collaboration Junxuan Zhang2*, Zhengxue Cheng1* , Yan Zhao1, Shihao Wang2, Dajiang Zhou2, Guo Lu1, Li Song1 1Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University 2Ant Group, Hangzhou, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, song EMAIL
Pseudocode No The paper describes the proposed method in Section 3, including subsections for Overall Architecture, Low-Complexity RWKV Models, Outlier-aware Tokenizer, and High-rank Reparameterization. However, it presents these descriptions using narrative text and a block diagram (Figure 2), rather than structured pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/alipay/L3TC-leveraging-rwkv-forlearned-lossless-low-complexity-text-compression.git
Open Datasets Yes Following the settings in (Del etang et al. 2024), we train models from scratch on enwik8 (Hutter Prize 2006) and test on enwik9, using a character-based tokenizer and vocabulary size of 128. The enwik8/enwik9 datasets, consisting of the first 1GB and 100MB of the English Wikipedia XML, are commonly used to evaluate text compression performance.
Dataset Splits Yes The enwik8/enwik9 datasets, consisting of the first 1GB and 100MB of the English Wikipedia XML, are commonly used to evaluate text compression performance. Since enwik8 only contains 10% data of enwik9, they represent a significant distribution shift. Therefore, we train our L3TC models on enwik8 and evaluate them on both enwik8 and enwik9 to assess the in-distribution and out-of-distribution compression performance.
Hardware Specification Yes The decoding speeds are measured on typical computing platforms, including server GPUs (NVIDIA A100 80 GB) and device NPUs (i Phone12 Apple Neural Engine).
Software Dependencies No The paper mentions using AdamW optimizer and converting models to Core ML packages (Apple 2023) for on-device performance measurement, but it does not provide specific version numbers for any software libraries, programming languages, or frameworks used for implementation beyond these general references.
Experiment Setup Yes For RWKV models, we adjust the number of layers, attention embedding dimension and hidden sizes to achieve target model sizes. We train the models using the Adam W (Loshchilov and Hutter 2019) optimizer with an initial learning rate of 1e-4 and a linear learning rate scheduler with a decay rate of 0.999 over 20 epochs, without warm-up. All the models are trained with a sequence length of 2048 bytes and a batch size of 64.