reproducibilityindex.ai

FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining

Authors: Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, Jun Zhao

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results demonstrate the effectiveness and robustness of Fin BERT.
Researcher Affiliation	Collaboration	Zhuang Liu1, , Degen Huang1 , Kaiyu Huang1 , Zhuang Li2 and Jun Zhao2 1Dalian University of Technology, Dalian, China 2Union Mobile Financial Technology Co., Ltd., Beijing, China
Pseudocode	No	The paper describes the model architecture and training tasks verbally but does not include any formal pseudocode or algorithm blocks.
Open Source Code	Yes	The source code and pre-trained models of Fin BERT are available online.
Open Datasets	Yes	English Wikipedia1 and Books Corpus (Zhu et al., 2015), which are the original training data used to train BERT (totaling 13GB, 3.31B words); Financial Web (totaling 24GB, 6.38B words), which we collect from the Common Crawl News dataset2 between July 2013 and December 2019, containing 13 million ﬁnancial news (15GB after ﬁltering), together with web-crawled ﬁnancial articles from FINWEB3 (9GB after ﬁltering); Yahoo Finance (totaling 19GB, 4.71B words), a dataset crawled from Yahoo Finance4. We crawled ﬁnancial articles (published in the last four years) from Yahoo Finance, and performed data cleaning (removing markup, removing non-textual content and ﬁltering out redundant data); Reddit Finance QA (totaling 5GB, 1.62B words), a corpus that contains automatically collected question-answer pairs about ﬁnancial issues from Reddit5 website with at least four up votes. The statistics for all pre-training data are reported in Table 1. We have built and maintained an open repository of ﬁnancial corpora6 that can be accessed and analyzed by anyone.
Dataset Splits	Yes	Fin SBD-2019 provides training data with boundary labels (beginning boundary vs. ending boundary) for each token. There are 953 distinct beginning tokens and 207 distinct ending tokens in the training and dev sets of Fin SBD-2019 data.
Hardware Specification	No	The paper mentions training on 'hundreds of GPUs' and 'GPU cards' but does not specify any particular GPU model (e.g., NVIDIA V100, RTX 3090) or other hardware components (CPU type, memory).
Software Dependencies	No	The paper mentions using 'Horovod framework' and 'TensorFlow' but does not provide specific version numbers for these software components.
Experiment Setup	No	The paper states, 'Our models, Fin BERTLARGE and Fin BERTBASE, have the same model settings of transformer and pre-training hyperparameters as BERT.' and mentions using 'mixed precision training methodology' with 'FP16' and 'loss-scaling'. However, it does not explicitly list the concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) for either pre-training or fine-tuning.