FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining

Authors: Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, Jun Zhao

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate the effectiveness and robustness of Fin BERT.
Researcher Affiliation Collaboration Zhuang Liu1, , Degen Huang1 , Kaiyu Huang1 , Zhuang Li2 and Jun Zhao2 1Dalian University of Technology, Dalian, China 2Union Mobile Financial Technology Co., Ltd., Beijing, China
Pseudocode No The paper describes the model architecture and training tasks verbally but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes The source code and pre-trained models of Fin BERT are available online.
Open Datasets Yes English Wikipedia1 and Books Corpus (Zhu et al., 2015), which are the original training data used to train BERT (totaling 13GB, 3.31B words); Financial Web (totaling 24GB, 6.38B words), which we collect from the Common Crawl News dataset2 between July 2013 and December 2019, containing 13 million financial news (15GB after filtering), together with web-crawled financial articles from FINWEB3 (9GB after filtering); Yahoo Finance (totaling 19GB, 4.71B words), a dataset crawled from Yahoo Finance4. We crawled financial articles (published in the last four years) from Yahoo Finance, and performed data cleaning (removing markup, removing non-textual content and filtering out redundant data); Reddit Finance QA (totaling 5GB, 1.62B words), a corpus that contains automatically collected question-answer pairs about financial issues from Reddit5 website with at least four up votes. The statistics for all pre-training data are reported in Table 1. We have built and maintained an open repository of financial corpora6 that can be accessed and analyzed by anyone.
Dataset Splits Yes Fin SBD-2019 provides training data with boundary labels (beginning boundary vs. ending boundary) for each token. There are 953 distinct beginning tokens and 207 distinct ending tokens in the training and dev sets of Fin SBD-2019 data.
Hardware Specification No The paper mentions training on 'hundreds of GPUs' and 'GPU cards' but does not specify any particular GPU model (e.g., NVIDIA V100, RTX 3090) or other hardware components (CPU type, memory).
Software Dependencies No The paper mentions using 'Horovod framework' and 'TensorFlow' but does not provide specific version numbers for these software components.
Experiment Setup No The paper states, 'Our models, Fin BERTLARGE and Fin BERTBASE, have the same model settings of transformer and pre-training hyperparameters as BERT.' and mentions using 'mixed precision training methodology' with 'FP16' and 'loss-scaling'. However, it does not explicitly list the concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) for either pre-training or fine-tuning.