Abstract
This page publishes a Bidirectional Encoder Representations from Transformers (BERT) model that was pre-trained with a huge Japanese clinical text (approximately 120 million lines). This model is released under the Creative Commons 4.0 International License (CC BY-NC-SA 4.0). To develop the model, we leverage the Tensorflow implementation of BERT published by Google on this page. This study is approved by the Institutional Review Board at the University of Tokyo Hospital (2019276NI).
UTH-BERT
-
-
UTH-BERT-BASE-128 (12-layer, 768-hidden, 12-heads)
- max_seq_length: 128
- max_position_embeddings: 512
- Whole Word Masking: Not applied
- Vocab size: 25,000 (Obtained with Byte Pair Encoding)
- Morphological analyzer: Mecab
- External dictionary: J-MeDic (MANBYO_201907) , mecab-ipadic-neologd
- Pre-training steps: mini-batch size 50 × 10 million steps
- Accuracy: MLM 0.773, NSP 0.975
- UTH_BERT_BASE_MC_BPE_V25000_10M.zip(deprecated)
-
-
-
UTH-BERT-BASE-512-WWM (12-layer, 768-hidden, 12-heads)
- max_seq_length: 512
- max_position_embeddings: 512
- Whole Word Masking: Applied
- Vocab size: 25,000 (Obtained with Byte Pair Encoding)
- Morphological analyzer: Mecab
- External dictionary: J-MeDic (MANBYO_201907) , mecab-ipadic-neologd
- Pre-training steps: mini-batch size 2048 × 352 kilo steps
- Accuracy: MLM 0.793, NSP 0.981
- UTH_BERT_BASE_512_MC_BPE_WWM_V25000_352K.zip (full)
- UTH_BERT_BASE_512_MC_BPE_WWM_V25000_352K.zip (pytorch)
-
Reference:
- Kawazoe Y, Shibata D, Shinohara E, Aramaki E, Ohe K. A clinical specific BERT developed using a huge Japanese clinical text corpus. PLoS One. 2021 Nov 9;16(11):e0259763.
Source code
- pre-processing text and tokenization is here