This paper discusses the compilation of the words.hk Cantonese dictionary dataset, which was compiled through manual annotation over a period of 7 years. Cantonese is a low-resource language with limited tagged or manually checked resources, especially at the sentential level, and this dataset is an attempt to fill the gap. The dataset contains over 53,000 entries of Cantonese words, which comes with basic lexical information (Jyutping phonemic transcription, part-of-speech tags, usage tags), manually crafted definitions in Written Cantonese, English translations, and Cantonese examples with English translation and Jyutping transliterations. Special attention has been paid to handle character variants, so that unintended “character errors” (equivalent to typos in phonemic writing systems) are filtered out, and intra-speaker variants are handled. Fine details on word segmentation, character variant handling, definition crafting will be discussed. The dataset can be used in a wide range of natural language processing tasks, such as word segmentation, construction of semantic web and training of models for Cantonese transliteration. Copyright © 2022 European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.
|Title of host publication||Proceedings of the 1st Workshop on Dataset Creation for Lower-Resourced Languages (DCLRL) @LREC2022|
|Place of Publication||France|
|Publisher||European Language Resources Association|
|Publication status||Published - Jun 2022|
CitationLau, C. M., Chan, G. W.-Y., Tse, R. K.-W., & Chan, L. S.-Y. (2022). Words.hk: A comprehensive Cantonese dictionary dataset with definitions, translations and transliterated examples. In Proceedings of the 1st Workshop on Dataset Creation for Lower-Resourced Languages (DCLRL) @LREC2022 (pp. 53-62). France: European Language Resources Association.
- Cantonese dictionary
- Parts of speech
- Word segmentation
- Character variants
- Semantic web