Experiments in Chinese Historical Phonology using matrix decomposition and factorization methods.
We use python for to prepare our data. The following packages are required:
- pandas
- numpy
- cjklib
- vPhon: a Vietnamese phonetizer: clone it to your local directory
/path/to/vPhon - fancyimpute: install it from github repository
In addition to cjklib, Unihan Database is used. The latest Unihan.zip can be downloaded from https://www.unicode.org/Public/UCD/. Unzip it to /path/to/Unihan.
Once you have cloned this repository to your local /path/to/ChnHistPhon, you can run
python /path/to/ChnHistPhon/ChnHistPhon_1_data_preparation.py
which will create ChnCharData.csv a dataset of Chinese characters we need in /path/to/ChnHistPhon/results.
We used softImpute (Mazumder et al., 2010.) to complete the data matrix in ChnCharData.csv, which is followed by dictionary learning and sparse coding in ChnHistPhon_2_run_SoftImpute_DictionaryLearning.py.
The results can be viewed here.