- 传统算法:使用N-gram,HMM,最大熵,CRF等实现中文分词
 - 神经⽹络⽅法:CNN、Bi-LSTM、Transformer等
 - 预训练语⾔模型⽅法:Bert等
 
- PKU 与 MSR 是 SIGHAN 于 2005 年组织的中⽂分词⽐赛 所⽤的数据集,也是学术界测试分词⼯具的标准数据集。
 
| 模型 | 准确率 | 召回率 | F1分数 | 
|---|---|---|---|
| Uni-Gram | 0.8550 | 0.9342 | 0.8928 | 
| Uni-Gram+规则 | 0.9111 | 0.9496 | 0.9300 | 
| HMM | 0.7936 | 0.8090 | 0.8012 | 
| CRF | 0.9409 | 0.9396 | 0.9400 | 
| Bi-LSTM | 0.9248 | 0.9236 | 0.9240 | 
| Bi-LSTM+CRF | 0.9366 | 0.9354 | 0.9358 | 
| BERT | 0.9712 | 0.9635 | 0.9673 | 
| BERT-CRF | 0.9705 | 0.9619 | 0.9662 | 
| jieba | 0.8559 | 0.7896 | 0.8214 | 
| pkuseg | 0.9512 | 0.9224 | 0.9366 | 
| THULAC | 0.9287 | 0.9295 | 0.9291 | 
| 模型 | 准确率 | 召回率 | F1分数 | 
|---|---|---|---|
| Uni-Gram | 0.9119 | 0.9633 | 0.9369 | 
| Uni-Gram+规则 | 0.9129 | 0.9634 | 0.9375 | 
| HMM | 0.7786 | 0.8189 | 0.7983 | 
| CRF | 0.9675 | 0.9676 | 0.9675 | 
| Bi-LSTM | 0.9624 | 0.9625 | 0.9624 | 
| Bi-LSTM+CRF | 0.9631 | 0.9632 | 0.9632 | 
| BERT | 0.9841 | 0.9817 | 0.9829 | 
| BERT-CRF | 0.9805 | 0.9787 | 0.9796 | 
| jieba | 0.8204 | 0.8145 | 0.8174 | 
| pkuseg | 0.8701 | 0.8894 | 0.8796 | 
| THULAC | 0.8428 | 0.8880 | 0.8648 |