Chinese tokenisation and tagging

There are no orthographic boundaries between words in Chinese. This is the main difficulty of working with Chinese computationally (in addition to the bewildering array of encodings used for Chinese and the simplified/traditional script controversy). A Chinese word frequently consists of two, three or more characters, while the definition of what counts as a word in Chinese is the subject of intense debates (though the same is true for other languages, constructions like as well as or give up have all the properties of a single word, and names, like White House, also mean what they are supposed to mean only taken as a whole).

You can download the following resources:

The tagset is described on the LCMC page. The accuracy of taggers is about 93-94%.

The lack of a single accepted definition of words also creates problems for part-of-speech tagging: if words output by a segmenter do not match what a tagger thinks of being as a word, the accuracy of the latter drops substantially. Four examples in question are:

particles and suffixes: are they listed as individual tokens or kept together with corresponding verbs/nouns as a single word? (学习了 or 学习了, 专家们 or 专家们)
measure words: are they separated as tokens? (两张纸 or 两张纸; this also applies to dates: 2007年7月2日)
names of people: are surnames combined with names or not? (邓小平 or 邓小平)
compounds: do we treat organisation names as units or decompose them into their constituent words? (安全理事会 or 安全理事会)

The parameter files listed above have been trained on the Lancaster Corpus of Mandarin Chinese (LCMC), so the decisions made in that corpus guide our segmentation rules.

The segmenter implements a simple longest word lookup algorithm with a couple of built-in heuristics for dealing with cases like 据报道 (when the first character is more likely to be a single token) and 就是 (when two characters may be a single word in some contexts). The algorithm is simple, but it achieves the accuracy of 94-95% on the test files from the SIGHAN 2005 Bakeoff competition. Nevertheless, much more can be achieved by clever statistical techniques, such as those described by participants in the SIGHAN competition (see the overview). The algorithm relies on the dictionary obtained from a segmented corpus, so its performance on out-of-vocabulary words is poor. Please get in touch if you want to contribute to open-source development of the segmenting tool listed on this page.

For the Internet and LCMC corpora I also computed their frequency lists: Internet corpus and LCMC corpus.

The resources and tools downloadable from this page all use UTF-8. They have been designed to work with the simplified script, though some provisions for the traditional script have been added as well.

The resources have been developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries. The tokeniser is based on the original code developed by Erik Peterson, Mandarin Tools; advice on tagging provided by Martin Thomas and Daming Wu. The tools are provided under the GNU General Public License. Back to other tools