Words.hk: A comprehensive Cantonese dictionary dataset with definitions, translations and transliterated examples

Chaak Ming LAU, Grace Wing-yan CHAN, Raymond Ka-wai TSE, Lilian Suet-ying CHAN

Research output: Chapter in Book/Report/Conference proceedingChapters

Abstract

This paper discusses the compilation of the words.hk Cantonese dictionary dataset, which was compiled through manual annotation over a period of 7 years. Cantonese is a low-resource language with limited tagged or manually checked resources, especially at the sentential level, and this dataset is an attempt to fill the gap. The dataset contains over 53,000 entries of Cantonese words, which comes with basic lexical information (Jyutping phonemic transcription, part-of-speech tags, usage tags), manually crafted definitions in Written Cantonese, English translations, and Cantonese examples with English translation and Jyutping transliterations. Special attention has been paid to handle character variants, so that unintended “character errors” (equivalent to typos in phonemic writing systems) are filtered out, and intra-speaker variants are handled. Fine details on word segmentation, character variant handling, definition crafting will be discussed. The dataset can be used in a wide range of natural language processing tasks, such as word segmentation, construction of semantic web and training of models for Cantonese transliteration. Copyright © 2022 European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.
Original languageEnglish
Title of host publicationProceedings of the 1st Workshop on Dataset Creation for Lower-Resourced Languages (DCLRL) @LREC2022
Place of PublicationFrance
PublisherEuropean Language Resources Association
Pages53-62
Publication statusPublished - Jun 2022

Citation

Lau, C. M., Chan, G. W.-Y., Tse, R. K.-W., & Chan, L. S.-Y. (2022). Words.hk: A comprehensive Cantonese dictionary dataset with definitions, translations and transliterated examples. In Proceedings of the 1st Workshop on Dataset Creation for Lower-Resourced Languages (DCLRL) @LREC2022 (pp. 53-62). France: European Language Resources Association.

Keywords

  • Cantonese dictionary
  • Diglossia
  • Corpora
  • Jyutping
  • Parts of speech
  • Word segmentation
  • Character variants
  • Semantic web
  • Crowdsourcing

Fingerprint

Dive into the research topics of 'Words.hk: A comprehensive Cantonese dictionary dataset with definitions, translations and transliterated examples'. Together they form a unique fingerprint.