PyCantonese: Cantonese linguistics and NLP in python

Jackson L. LEE, Litong CHEN, Charles LAM, Chaak Ming LAU, Tsz-Him TSUI

Research output: Chapter in Book/Report/Conference proceedingChapters

7 Citations (Scopus)

Abstract

This paper introduces PyCantonese, an open-source Python library for Cantonese linguistics and natural language processing. After the library design, implementation, corpus data format, and key datasets included are introduced, the paper provides an overview of the currently implemented functionality: stop words, handling Jyutping romanization, word segmentation, part-of-speech tagging, and parsing Cantonese text. Copyright © 2022 European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.
Original languageEnglish
Title of host publicationProceedings of the 13th Language Resources and Evaluation Conference
Place of PublicationFrance
PublisherEuropean Language Resources Association
Pages6607-6611
Publication statusPublished - Jun 2022

Citation

Lee, J. L., Chen, L., Lam, C., Lau, C. M., & Tsui, T.-H. (2022). PyCantonese: Cantonese linguistics and NLP in python. In Proceedings of the 13th Language Resources and Evaluation Conference (pp. 6607-6611). France: European Language Resources Association.

Keywords

  • Cantonese
  • Jyutping
  • Word segmentation
  • Part-of-speech tagging
  • Stop words

Fingerprint

Dive into the research topics of 'PyCantonese: Cantonese linguistics and NLP in python'. Together they form a unique fingerprint.