UTH-BERT: a BERT pre-trained with Japanese clinical text

Abstract

This page publishes a Bidirectional Encoder Representations from Transformers (BERT) model that was pre-trained with a huge Japanese clinical text (approximately 120 million lines). This model is released under the Creative Commons 4.0 International License (CC BY-NC-SA 4.0). To develop the model, we leverage the Tensorflow implementation of BERT published by Google on this page. This study is approved by the Institutional Review Board at the University of Tokyo Hospital (2019276NI).

UTH-BERT 

    • UTH-BERT-BASE-128 (12-layer, 768-hidden, 12-heads)

      • max_seq_length: 128
      • max_position_embeddings: 512
      • Whole Word Masking: Not applied
      • Vocab size: 25,000 (Obtained with Byte Pair Encoding)
      • Morphological analyzer: Mecab
      • External dictionary: J-MeDic (MANBYO_201907) ,  mecab-ipadic-neologd
      • Pre-training steps: mini-batch size 50 × 10 million steps
      • Accuracy: MLM 0.773, NSP 0.975
      • UTH_BERT_BASE_MC_BPE_V25000_10M.zip(deprecated)

Reference:

Source code

  • pre-processing text and tokenization is here