UTH-BERT: a BERT pre-trained with Japanese clinical text – 医療AI・デジタルツイン開発学講座

Abstract

This page publishes a Bidirectional Encoder Representations from Transformers (BERT) model that was pre-trained with a huge Japanese clinical text (approximately 120 million lines). ~~This model is released under the Creative Commons 4.0 International License (CC BY-NC-SA 4.0). To develop the model, we leverage the Tensorflow implementation of BERT published by Google on this page.~~ This study is approved by the Institutional Review Board at the University of Tokyo Hospital (2019276NI).

UTH-BERT

- UTH-BERT-BASE-128 (12-layer, 768-hidden, 12-heads)
  - max_seq_length: 128
  - max_position_embeddings: 512
  - Whole Word Masking: Not applied
  - Vocab size: 25,000 (Obtained with Byte Pair Encoding)
  - Morphological analyzer: Mecab
  - External dictionary: J-MeDic (MANBYO_201907) , mecab-ipadic-neologd
  - Pre-training steps: mini-batch size 50 × 10 million steps
  - Accuracy: MLM 0.773, NSP 0.975
  - UTH_BERT_BASE_MC_BPE_V25000_10M.zip（deprecated）

- UTH-BERT-BASE-512-WWM (12-layer, 768-hidden, 12-heads)
  - max_seq_length: 512
  - max_position_embeddings: 512
  - Whole Word Masking: Applied
  - Vocab size: 25,000 (Obtained with Byte Pair Encoding)
  - Morphological analyzer: Mecab
  - External dictionary: J-MeDic (MANBYO_201907) , mecab-ipadic-neologd
  - Pre-training steps: mini-batch size 2048 × 352 kilo steps
  - Accuracy: MLM 0.793, NSP 0.981
  - ~~UTH_BERT_BASE_512_MC_BPE_WWM_V25000_352K.zip (full)~~（deprecated）
  - ~~UTH_BERT_BASE_512_MC_BPE_WWM_V25000_352K.zip (pytorch)~~（deprecated）

Reference:

Kawazoe Y, Shibata D, Shinohara E, Aramaki E, Ohe K. A clinical specific BERT developed using a huge Japanese clinical text corpus. PLoS One. 2021 Nov 9;16(11):e0259763.

Source code

pre-processing text and tokenization is here

Abstract

UTH-BERT

UTH-BERT-BASE-128 (12-layer, 768-hidden, 12-heads)

UTH-BERT-BASE-512-WWM (12-layer, 768-hidden, 12-heads)

Reference:

Source code